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Preface 


This volume collects selected papers presented at the 11th biannual meeting of 
the Classification and Data Analysis Group (CLADAG) of the Societa Italiana di 
Statistica, held in Milan, September 13-15, 2017. 

The program of the conference included 142 presentations, organized in 3 
plenary talks, 21 invited sessions, 18 contributed sessions, and a poster session. We 
wish to express our gratitude to the authors, whose enthusiastic participation made 
the meeting possible. The conference provided a vibrant international forum for 
discussion and a mutual exchange of knowledge, thanks to the 163 attendees and 
authors coming from several European countries, like Austria, Denmark, France, 
Germany, Great Britain, Ireland, the Netherlands, Norway, Poland, Spain, and 
Switzerland, as well as from the United States and Japan. The Scientific Committee 
of the Conference was chaired by Francesco Mola, and Francesca Greselin was the 
Chairperson of the Local Organizing Committee. 

The topics of Plenary and Invited Sessions were carefully chosen by the 
Scientific Committee in view of the CLADAG mission: to promote methodological, 
computational, and applied research in classification, data analysis, and multivariate 
statistics. We thank all the organizers of the sessions for inviting renowned speakers. 
We extend our gratitude to all the chairpersons and active participants, whose 
interesting comments and suggestions made the conference a real motivating event. 

The 20 manuscripts included in the present volume were selected, through a blind 
review process, among the ones presented at the conference and later submitted for 
the publication in the Springer book series. We are greatly indebted to the referees 
(at least two scholars were involved for each paper) for the time and effort they spent 
in such a careful review. 

The volume is divided into five parts as follows: Clustering and Classification, 
Exploratory Data Analysis, Statistical Modeling, Graphical Models, and Big Data 
Analysis. 

The first part, Clustering and Classification, contains methodologically oriented 
papers. The paper by Fordellone and Vichi presents the combined usage of 
unsupervised classification with supervised methods to enhance the assessment and 
the interpretation of the obtained partition; Rainey, Tortora, and Palumbo are the 
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authors of the second work, that introduces a parametric version of probabilistic 
distance clustering based on the Gaussian and the Student’s t multivariate density 
distributions. Cappozzo and Greselin deal with an interesting application to wine 
authenticity studies based on robust clustering methods, where a mixture of 
Gaussian factors is used to ascertain varietal genuineness and distinguish potentially 
doctored food. Simulation results are presented by Alfo, Nieddu, and Vitiello 
to study the performance of the cluster-weighted beta regression in a variety of 
empirical settings. The results show that the model captures individual specific 
unobserved heterogeneity and its link with observed covariates and indicate some 
shortcomings related to the scale of the observed quantities. The last paper of this 
part, by Ranalli and Rocci, presents an overview on a recent model-based approach 
to cluster ordinal variables. The aim is to extend the proposal to the case where noise 
dimensions or variables are present, and to generalize the model to mixed-type data. 

The second part, devoted to Exploratory Data Analysis, contains a first paper by 
Bove, Ruta, and Mastandrea where multidimensional scaling and unfolding allow 
to easily detect preference order, size of asymmetry, and relationships between 
subjects and stimuli coming from the curvature of architectural facades. In the 
second paper, De Stefano, Vitale, and Zaccarin submit exploratory results obtained 
by adopting two well-known community detection methods and a new proposal, 
aiming at discovering groups of scientists in the coauthorship network of Italian 
academic statisticians. In their study, Okada and Tsurumi extend the asymmetric 
multidimensional scaling to analyze differences among consumers and to show 
how each consumer or a group of consumers relates to brands in brand switching. 
In the fourth paper, Solaro faces the problem of comparing the results obtained 
with different imputation methods, so critical in many fields of application, such 
as cardiovascular studies in the medical context. She also assesses the quality of 
imputation, through the dissimilarity profile analysis. 

The third part refers to Statistical Modeling. The first paper is by Altimari, 
Balzano, and Zezza who introduce an extended version of the Economic Vulner- 
ability Index, adopted by the United Nations. By the partial least square approach 
to structural equation model, an estimate of the extended index is obtained using 
data from a panel of 98 countries over 19 years. Ascari, Migliorati, and Ongaro 
present two Bayesian procedures—both based on Gibbs sampling—to estimate 
the parameters of the flexible Dirichlet. This distribution is particularly suited for 
compositional data, thanks to its mixture structure and to the additional parameters 
that allow for a more flexible modeling of the covariance matrix. Fiori and 
Motta contribute a new stochastic model of firm dynamics that leads to a Dagum 
distribution for the size of business firms operating in a given industry. The model 
relies on a stochastic growth process that was originally introduced in the context 
of income inequality studies and sheds new light upon the connections between 
growth dynamics and the meaning of parameters that appear in the steady-state 
distribution of firm size. A mixture of Mincer’s models with concomitant variables 
is proposed by Mazza, Battisti, Ingrassia, and Punzo. The new model provides 
a flexible generalization of the Mincer model, a breakdown of the population 
into several homogeneous subpopulations, and an explanation of the unobserved 
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heterogeneity. The proposal is motivated and illustrated via an application to data 
provided by the Bank of Italy’s Survey of Household Income and Wealth in 2012. 
In his paper, Nakai presents interesting results about the impact of women’s own 
human capital on contribution to household income, by using multinomial logistic 
regression upon data coming from two national social surveys conducted in 1985, 
1995, 2005, and 2015 in Japan. The last paper of this part by Vernizzi and Nakai 
introduces an optimization algorithm that improves the size of the final dataset after 
applying the listwise deletion method. The proposed weighted optimization method 
has been applied, first to some toy examples, and then to the National Longitudinal 
Survey of Youth dataset. 

The fourth part—devoted to Graphical Models—contains a first paper by 
Marella, Vicard, Vitale, and Ababei, which proposes a procedure based on non- 
parametric Bayesian networks to detect and correct measurement error. The novel 
procedure is evaluated on a validation sample associated to the Bank of Italy survey 
on household income and wealth. In the second paper, Musella, Vicard, and Vitale 
address the issue of Bayesian network structural learning for non-paranormal data. 
They propose a modified version of the Copula Grow-Shrink algorithm whose high 
performance is proved through a simulation study. In addition, an application to 
Italian energy market data is also provided. The third and last paper of this part is 
by Nicolussi and Cazzaro; it aims to incorporate the contest-specific independence 
conditions in graphical models. The authors take advantage of the Hierarchical 
Multinomial Marginal parametrization to represent the dependence relationships. 
The proposal is applied on the study of the trend of innovation degree for Italian 
enterprises. 

The last part is devoted to Big Data Analysis. The first paper, by Giuffrida, 
Gozzo, Rinaldi, and Tomaselli, proposes a new approach to address, in the Big 
Data world, the analysis of relational structures to improve actionable analytics- 
driven decision patterns. An application to model online news is provided. In the 
second paper, Pesce, Riccomagno, and Wynn discuss some issues related to the use 
of experimental design to help establish causation in complex models. The question 
to remove bias is also considered: various solutions are discussed, including 
randomization. 

We cannot conclude this brief introduction without some further thanks. We 
gratefully acknowledge the Department of Statistics and Quantitative Methods 
and the University of Milano Bicocca, which strongly supported the CLADAG 
conference. We shared constructing ideas with some colleagues from Milan, 
valuably supported by their Institutions: Piercesare Secchi from Politecnico, Laura 
Deldossi from the Universita Cattolica del Sacro Cuore, Pieralda Ferrari from the 
Universita di Milano, and Raffaella Piccarreta from the Universita Bocconi. To all 
of them go our most sincere appreciations. Special thanks are due to the members 
of the Local Organizing Committee. They all did a great job! A particular mention 
goes to Mariangela Zenga for her tireless activity and enthusiasm. 
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We are grateful to Oracle Corporation, Data Reply, and Fondazione Cariplo. 
They supported the event and made possible the data challenge for Young Cladag. 
They also sponsored the final concert. Maestro Alessandro Arnoldo conducted the 
Milan Chamber Orchestra, and created a delightful moment for all participants. 

Finally, we acknowledge Dr. Veronika Rosteck of Springer-Verlag, Heidelberg, 
for her support and dedication to the making of this volume. 


Milan, Italy Francesca Greselin 
Milan, Italy Laura Deldossi 
Piacenza, Italy Luca Bagnato 
Rome, Italy Maurizio Vichi 


December 2018 
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Part I 
Clustering and Classification 


Cluster Weighted Beta Regression: ®) 
A Simulation Study Greet 


Marco Alfé, Luciano Nieddu, and Cecilia Vitiello 


Abstract In several application fields, we have to model a response that takes 
values in a limited range. When these values may be transformed into rates, propor- 
tions, concentrations, that is to continuous values in the unit interval, beta regression 
may be the appropriate choice. In the presence of unobserved heterogeneity, for 
example when the population of interest is composed by different subgroups, finite 
mixture of beta regression models could be useful. When conditions of exogeneity 
of the covariates set are not met, extended modeling approaches should be proposed. 
For this purpose, we discuss the class of cluster-weighted beta regression models. 


Keywords Beta regression - Finite mixtures - Cluster-weighted regression 


1 Introduction 


Frequently, we are interested in describing the (conditional on covariates) distri- 
bution of a continuous response variable taking values in a limited interval, which 
could be mapped onto the open unit interval. Such variables can be observed in 
a wide variety of empirical situations, in the form of relative frequencies of an 
event (e.g., number of votes obtained by a candidate out of the total votes cast 
at an election), fractions of a continuous variable (e.g., amount of GDP due to a 
specific economic sector), performance ratings (e.g. student performances when 
compared to the maximum performance attainable), and limited-valued indexes 
(e.g., the relative Gini index). Examples of this framework can be found in medical 
(see, e.g., [12, 21, 22]), education (see, e.g., [3]), and economics [2, 6] research. 
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In this context, the use of a standard linear model is not a feasible solution (see, 
e.g., [17] and [14]). A naive solution is to map the response onto the real line (for 
example using a probit transform), so that a standard regression model could be 
used [7]. While this approach is preferable when compared to a linear model for 
the original response variable, it suffers from the well-known shortcomings, see [1] 
and [19]. A viable alternative is to use a regression model based on a conditional 
beta distribution for the response Y given the p-dimensional vector of covariates 
x, thatis Y | x ~ &(p,qg), Y e€ (0,1), with parameters p,q > 0. This 
model has been introduced by Ferrari and Cribari-Neto [8], and extended by Ospina 
and Ferrari [15, 16] to account for those cases where the response is defined over 
the closed interval [0, 1], via (factorizable) mixtures of discrete and continuous 
distributions. For modeling purposes, Ferrari and Cribari-Neto [8] proposed the 
following parameterization: 


: p pda 
MY) =u= —: V YY) = ———_ 1 














where 6 = p+ q represents the precision parameter. The beta distribution may 
assume different shapes according to different combinations of the (p, q), or (2, d), 
parameters. Parameter estimates can be obtained using a maximum likelihood (ML) 
approach, see [8]; for this purpose, the R library Betareg, [5], can be used. In 
some cases, however, individual heterogeneity is only partially accounted for by the 
observed covariates, and continuous/discrete mixed effect beta regression should be 
taken into consideration. In particular, when the omitted covariates can be described 
by a latent variable with a discrete distribution, or the population of interest is 
composed by several subgroups, characterized by different values of regression 
parameters, finite mixtures of beta regressions represent a viable option. This model 
has been introduced in the literature by Griin et al. [10], who designed the R library 
betamix. As in standard finite mixture models, this is based on the (sometimes 
non-explicitly stated) assumption of assignment independence, see [11], which can 
be also considered as a sort of exogeneity of observed covariates with respect to the 
discrete latent variable. Two issues are, in this case, of interest. First, due to the wide 
range of different shapes the beta distribution may take, a further parameterization 
has been proposed by Chen [4], see also [1]. This is motivated by identifiability 
issues and it is based on the subclass of unimodal beta densities. This prevents 
the problem of a U-shaped distribution being fitted by a mixture of two J-shaped 
ones. However, in the present context, we prefer an approach based on modeling the 
mean of the response, rather than its mode. Second, in several empirical cases, the 
omitted covariates described by the latent variables and the observed covariates may 
not be independent, and this may question the reliability of parameter estimates. In 
fact, if we do not account for such a dependence, the estimated impact of observed 
covariates may be either due to a direct effect on the response or to the (mediated) 
effect of omitted covariates. In a former paper, we introduced the class of cluster- 
weighted beta regression models, see [9], to capture individual specific unobserved 
heterogeneity and its link with observed covariates (see [13] and [18]). 
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In the next section, after discussing the standard finite mixture approach, we 
motivate the proposed approach. The EM algorithm for ML parameter estimation 
is sketched in Sect.3 while, in Sect.4, we report the results of a simulation study. 
Some concluding remarks are drawn in Sect. 5. 


2 The Model 


Let (Y, X) be the set including a response variable Y and a covariates vector X; 
let the corresponding population be partitioned into K subpopulations, referred to 
as components, and let 7;(x;) denote the prior probability that unit i belongs to 
component k = 1,...,K. We associate to subpopulations the indicator vector 
with elements z;, = | if unit i belongs to component k. We further assume that, 
conditional on being in the k-th component, k = 1,..., K, the following model 
holds 


Y; | Xi,zik = 1 ~ B(pik, dik), Pik, dik > O (2) 


Mik. — Lik) 
= Ee Hy (xiv) - 


Dik ! 
= nik = hi (x, B Pik = 
Nik 1 ( i x) ik 1 : 


“= —— 

Pik + ik 
That is, each component is associated to possibly varying parameter vectors, B; 
and y;, in the models for the mean and the precision parameter of that component. 
However, the component indicators are not observed and, for purpose of estimation 
in an ML framework, we need to define the observed data joint density: 


K 
fO% 1x) = D> FO% |, zie = Dem) (3) 


k=1 


While the first term in the sum denotes the (conditional) beta density, the term 
(xi), k = 1,...,K, has been previously defined. A usual assumption is that 
IK (X;) = mx Vk, known as assignment independence. When the independence is not 
met, this assumption may lead to a severe bias in model parameter estimates. This 
could be simply motivated by looking at the graph in Fig. 1; here, x; has a non-zero 
impact on Y; either directly or indirectly through z;,. By adopting an independence 
assumption, we use a misspecified model. As a result, we remove the gray dashed 
edge from x; to zjx, and inflate the impact of the covariates on the response Y;. 
In the case x; and z;z are not independent, we need to either model the impact of 
covariates on the component indicator, or go for the specification of the marginal 
density of the couple (Y, X). In the former case, we cannot distinguish between the 
direct and the indirect impact of X; in fact, there is no way, as opposed to the case 
when repeated observations are available, to test whether the effect of the observed 
covariates on Y or on Y through Z. For the latter, we may turn to consider a more 
flexible family of mixture models, the cluster-weighted models (CWM), which can 
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Fig. 1 A simple path 
diagram. (a) True model. (b) 
Misspecified model dir. 





be obtained by releasing the assignment independence hypothesis. This approach 
has been introduced by Gershenfeld [9] as a model based clustering approach for 
the couple (Y, X). See [13] and [18] for extensions. The model is: 


K 
foami=) cele =D | see a ime (4) 


k=1 


Here, f(y; | xj, Zix = 1) is the density of the response conditional on the set of 
covariates and the component the unit belongs to, and g(x; | zjz = 1) denotes the 
distribution of the covariates in the specific component k = 1,..., K. 


3 ML Parameter Estimation 


When the covariates are continuous, it is customary to adopt a component-specific 
multivariate Gaussian density, that is X | zz, = 1 ~ MVN(vx, Xx). With 
non-continuous or mixed-type covariates more general models should be used 
to describe component-specific covariates distribution, [13]. A (conditional) beta 
model: 


gi(ui) = mk = X; Bx g2(bi) = Ee = XK (5) 


is used for the response, while the link functions gi(-) and g2(-) are monotone 
and twice differentiable. Given these modeling assumptions, the observed data 
likelihood is: 


n K 
LWW) =[ [D6 fOr x. Be. YDSO | VE, Dad 


i=1 k=1 


where 6, = (B;, yx) denotes the vector of regression parameters, while 
oO = (%, 2X), kK = 1,...,K, represents the parameters for the condi- 
tional Gaussian distribution for the set of observed covariates. Last, yw = 
(01,...,9K,@1,...,@K,71,..-,UK-1) denotes the global set of model 
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parameters. Let €.(y) denote the log-likelihood function for the complete data 
(Vi, Xi, Zi}i=1,....n- For fixed K, at the r-th iteration of the EM algorithm, r = 1,..., 
the E-step computes the expected value of the log-likelihood function for complete 
data, conditional on the observed data and the current parameters estimates prod, 
This is referred to as O(W | vad —)D). In the M-step of the algorithm, this function 
is maximized with respect to w. The algorithm alternates the two steps until 
convergence, defined in terms of the norm of the difference between two subsequent 
values of model parameters: 


E-step Compute O( | W°~?) = Eyo—» (€c(W) | X, y) 
M-step maximize O(w | #"~) w.rt. y to obtain updated estimates py” 














In the E-step the missing component indicator zj;, is replaced by the corresponding 
conditional expectation 


1 1 
©. FOr) Of?) ec | of yal? 
ik ~~ 1 1 —1)’ 
SE Foie 0" eae eG ) 


that represents the posterior probability that the i-th unit comes from the k-th 
component, i = 1,...,n,k = 1,...,K. ML equations for the beta model and 
the Gaussian density parameters are given by the following expressions 


=0 and 


ae) _ > (nA log(f Oi | xi, 9%) =i 


—_ (r) eee | @x)) 
a6k Pik d0k i. =m 


i=1 


Both expressions are weighted score equations with weights given by wi. The 
updated estimates for the Gaussian density parameters are available in closed form: 


(r) (r) a(r) 5) 
5M — i Xj Wig sad) viet Wig (xi — BE OG — OY 
ko (r) ko (r) 

ie Wik Tih Wik 


while a? = ay wi? /n, a well known result in finite mixtures. 


4 Simulation Study 


To study the proposed model performance in a variety of empirical settings, we 
defined the following simulation experiment. The aim is to evaluate the model’s 
ability to recover the true parameter values and the coverage of the corresponding 
confidence intervals. Further, we evaluate whether the estimated component mem- 
bership recover the true partition of the data. To make the data easy to interpret 
and visualize, the covariates X; = (X;1, Xj2) have been drawn from a multivariate 
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Gaussian density X;|zix = 1 ~ MV N(x, 2’) and the number of groups has been 


(0 O}k=1 
set to K = 3 where wz = } {-1 +1} k =2 
{+142} k=3 


For each component we have drawn ng = 700, k = 1,2,3 units resulting in 
a total sample size n =2100. Two possible scenarios have been considered for the 
covariance matrix, namely: 


o 0 p) 
A: Y= 0 o? ; o- = {0.09+a-0.02}, a=1,...10 


2, 
Bab) P|: a1 € {1,2},02 € {1,2}, p € {0, 0.2, 0.8} 
2 


where b is a multiplicative factor assuming values in {0.09, 0.11, 0.13}. Values for 
the response variable Y;|zjx = 1 have been drawn from a beta distribution with 
parameters (jx, dik) Which have been chosen, conditionally on the component 
membership, according to the linear predictors 


Mik = Bor + Bigxi,1 + Ba, %i,2 dik = Vox + VinXid + V3pXi,2 


the coefficients of the linear predictors have been set to: 


0-03 0 k=1 00-1 k=1 
(Bc, BY, BS}, = 4-05-09 04 k=2 (yg vE vite = 402005 k=2 
0.5 04-03 k=3 0.5003 k=3 


For each parameter configuration, 200 samples have been considered. For 
estimation purpose we considered k = 1,...,5 components and the best solution 
was chosen according to BIC. We report in Table | the values of the Rand index 
[20] between the original membership and the classification obtained using the 
CWR approach. As expected, with increasing noise, the quality of the classification 
deteriorates, decreasing from 0.98 for «* = 0.09 to 0.80 for o? = 0.29. As shown 
in the column standard deviation in Table 1, the variability of the Rand index 
distribution tends to increase as well. Overall, the coherence between the original 
classification and the one obtained using the proposed approach is quite satisfactory 
especially considering that our aim was to derive accurate estimates. 

We have calculated the empirical coverage of the confidence intervals defined 
at the nominal level 1 — a = 0.95 for parameter estimates of the beta regression 
model, defined as the proportions of samples where the confidence intervals include 
the true parameter values in both scenarios (Tables 2 and 3). We do not observe 
a strong effect of the noise variance on the coverage, as the empirical proportions 
are all very close to the nominal confidence level even for large values of the noise 
variance. 


Cluster Weighted Beta Regression: A Simulation Study 9 


Table 1 Average Rand 
index and corresponding 
standard deviation by values 
of o? 


So 
of confidence intervals for 


model parameters: Scen. A vee 
0.94 










Table 3 Empirical coverage p lb | Component 1 | Component 2 | Component 3 
of confidence intervals for 
model parameters: Scen. B 0.0 0.09 [0.97 0.94 0.96 


0.0 |0.11 |0.91 0.96 | 0.95 


0.0 0.95 0.95 
0.2 [0.09 |0.95 | 0.94 0.93 


0.2 |011 |0.95 0.96 | 0.94 
0.2 0.97 0.96 
0.8 |0.09 0.96 | 0.93 0.93 
0.8 |0.11 |0.96 0.93 0.93 
0.8 0.91 0.96 


5 Concluding Remarks 


We discuss cluster-weighted Beta regression to model the location and the precision 
parameter for a response with a conditional beta distribution and a multivariate 
Gaussian set of observed covariates. The proposal embeds the finite mixture of beta 
regressions as a particular case, see [10], when g(x; | zix = 1) = g(x;), that is 
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when the distribution of the observed covariates does not change across components. 
In fact, in that case, the likelihood for model parameters can be factorized, and it 
does not depend on the observed covariates distribution. While this is quite clear 
from a theoretical point of view, there can be some shortcomings when dealing 
with the implementation of the proposed method. Namely, the adopted objective 
function depends on the “scale” of the observed quantities and, when no clustering 
on the covariates is present, a single multivariate Gaussian component is well more 
parsimonius than a finite mixture of Gaussian components which is what would 
be implied by the finite mixture of the beta regression. Therefore the fact that the 
proposed method suggests K = | may well not be a signal that the beta regression 
model is homogeneous but, rather, that the Gaussian component is. 
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Detecting Wine Adulterations Employing @® 
Robust Mixture of Factor Analyzers need 


Andrea Cappozzo and Francesca Greselin 


Abstract An authentic food is one that is what it claims to be. Nowadays, more 
and more attention is devoted to the food market: stakeholders, throughout the 
value chain, need to receive exact information about the specific product they are 
commercing with. To ascertain varietal genuineness and distinguish potentially 
doctored food, in this paper we propose to employ a robust mixture estimation 
method. Particularly, in a wine authenticity framework with unobserved heterogene- 
ity, we jointly perform genuine wine classification and contamination detection. Our 
methodology models the data as arising from a mixture of Gaussian factors and 
depicts the observations with the lowest contributions to the overall likelihood as 
illegal samples. The advantage of using robust estimation on a real wine dataset 
is shown, in comparison with many other classification approaches. Moreover, the 
simulation results confirm the effectiveness of our approach in dealing with an 
adulterated dataset. 


Keywords Mixtures of factor analyzers - Food authenticity - Model-based 
clustering - Wine adulteration - Robust estimation - Impartial trimming 


1 Introduction and Motivation 


The wine segment is identified as a luxury market category, with savvy as well as 
non-expert customers willing to spend a premium price for a product of a specific 
vintage and cultivar. Therefore, in the context of global markets, analytical methods 
for wine identification are needed in order to protect wine quality and prevent its 
illegal adulteration. 
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In the present work we employ an approach based on robust estimation of 
mixtures of Gaussian Analyzers, for discriminating corrupted red wines samples 
from their authentic variety. In a modeling context, we assume a probability distri- 
bution function for the chemical and physical characteristics measured on the wines, 
considering a density in the form of a mixture, whenever the dataset presents more 
than a wine variety. As a consequence, the probability that a wine sample comes 
from one specific grape can be estimated from the model, performing classification 
through the Bayes rule. Robust estimation of the parameters in the model is adopted 
to recognize the corrupted data. Particularly, we expect that adulterated observations 
would be implausible under the robustly estimated model: the illegal subsample 
is revealed by selecting observations with the lowest contributions to the overall 
likelihood using impartial trimming, without imposing any assumption on their 
underlying density. 

The rest of the paper is organized as follows: in Sect. 2 the notation is introduced 
and the main concepts about Gaussian Mixtures of Factor Analyzers (MFA), 
trimmed MFA likelihood, and the Alternating Expectation-Conditional Maximiza- 
tion (AECM) algorithm are summarized. Section 3 presents the wine dataset [7] 
and classification results obtained performing a robust estimation of Gaussian 
mixtures of factor analyzers. Section 4 reports a simulation study carried out 
employing parameters estimated from the model in Sect. 3, in a specific framework 
of contaminated dataset. 

The original contribution of the present paper is given in the benchmark study on 
unsupervised methods, the adaptation of the robust Bayesian Information Criterion 
(BIC) introduced in [3] to MFA, and a first application of robust MFA in a somehow 
realistic adulteration scenario. 

An application on real data and some simulation results confirm the effectiveness 
of our approach in dealing with an adulterated dataset when compared to analogous 
methods, such as partition around medoids and non-robust mixtures of Gaussian 
and mixtures of patterned Gaussian factors. 


2 Mixtures of Gaussian Factors Analyzers 


In this section we briefly recall the definition and some features of the mixture of 
Gaussian Factor Analyzers (MFA) and its parameter estimation procedure. MFA 
is a powerful tool for modeling unobserved heterogeneity in a population, as it 
concurrently performs clustering and local dimensionality reduction, within each 
cluster. Let X;,..., X, be a random sample of size n on a p-dimensional random 
vector. An MFA assumes that each observation X; is given by 


X; = fy + AgUig + Cig (1) 
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with probability zm, for g = 1,...,G. The total number of components in the 
mixture is denoted by G, fg are p x 1 mean vectors, A, are the p x d matrices 


of factor loadings, Uig ““ W (0,1a) are the factors, eig vo, W.) are the 
errors, and W, are p x p diagonal matrices. Note that d < p, that is the p 
observable features are supposed to be jointly explained by a smaller number of 
d unobservable factors. Further, Ujg and ejg are independent, fori = 1,...,n 
and g = 1,...G. Unconditionally, therefore, X; has a density in the form of a 
G-components multivariate normal mixture: 


G 
fx; (xi; 0) = So tebp (Ki; My, Ze) (2) 


g=l 


where @p(-; Meg, 2 ,) denotes the p-multivariate normal density, whose covariance 
matrix 2, has the following decomposition X', = A gA, + We. 

When estimating MFA through the usual Maximum Likelihood approach, two 
issues arise. Firstly, departure from normality in the data may cause biased or 
misleading inference. Some initial attempts in the literature to overcome this 
issue propose to consider mixtures of t-factor analyzers [15], but the breakdown 
properties of the estimators are not improved [10]. The second concern is related 
to the unboundedness of the log-likelihood function [4], which leads to esti- 
mation issues, like the appearance of non-interesting spurious maximizers and 
degenerate solutions. To cope with this second issue, Common/Isotropic noise 
matrices/patterned covariances [1] and a mild constrained estimation [9] have 
been considered. The methodology considered here employs model estimation, 
complemented with trimming and constrained estimation, to provide robustness, 
to exclude singularities, and to reduce spurious solutions, along the lines of [8]. 
Therefore, with this approach, we overcome both previously mentioned issues. 

A mixture of Gaussian factor components is fitted to a given dataset 
X1, X2,..., X, in R? by maximizing a trimmed mixture log-likelihood [18], 


n G 
Lrrim = Y 16%) log | D7 bp (xis Mg, Ag, Ve) Me (3) 


i=1 g=l1 


where ¢(-) is a 0-1 trimming indicator function that tells us whether observation 
x; is trimmed off or not. If ¢(x;)=0 x; is trimmed off, otherwise ¢(x;)=1. A 
fixed fraction a@ of observations, the trimming level, is unassigned by setting 
ae ¢(x;) = [n(1 —a@)], where the less plausible observations under the currently 
estimated model are tentatively trimmed out at each step of the iterations that lead to 
the final estimate. In the specific application to wine authenticity analysis described 
in Sect. 3, they are supposed to be originated by wine adulteration. 

Then, a constrained maximization of (3) is adopted, by imposing Wy; < 
CWhmm for! <1 Am < pandl < g #h < G, where {Wy y}i=1 


Siseeg 
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the diagonal element of the noise matrices Wy, and 1 < c < +o, to avoid the 
|X’2| — 0 case. This constraint can be seen as an adaptation to MFA of those 
introduced in [11]. The Maximum Likelihood estimator of W, under the given 
constraints leads to a well-defined maximization problem. 

The Alternating Expectation-Conditional Maximization—an extension of the 
Expectation-Maximization algorithm—is considered, in view of the factor structure 
of the model. The M-step is replaced by some computationally simpler conditional 
maximization (CM) steps, along with different specifications of missing data. The 
idea is to partition the vector of parameters 6 = (61, 65)’ , in such a way that Yim 
is easy to be maximized for 0; given @2 and vice versa. Therefore, two cycles are 
performed at each algorithm iteration: 

1*'cycle : we set 0; = {7g, Meg = 1,..., G}; here, the missing data are the 
unobserved group labels Z = (z),...,z),). After applying a step of Trimming, by 
assigning to the observations with lowest likelihood a null value of the “posterior 
probabilities”, we get one E-step, and one CM-step for obtaining parameters in 61. 

2"4 cycle : we Set 02 = {Ag,W,,g = 1,..., G}, here the missing data are 
the group labels Z and the unobserved latent factors Uj1,..., Ung. We perform a 
Trimming step, then a E-step, and a constrained CM-step, i.e., a conditional exact 
constrained maximization of Ag, Wg. 

A detailed description of the algorithm is given in [8]. 


3 Wine Recognition Data 


The wine recognition dataset, firstly analysed in [7], reports results of a chemical 
and physical analysis for three different wine types, grown in the same region in 
Italy. Originally, 28 attributes were recorded for 178 wine samples derived from 
three different cultivars: Barolo, Grignolino, and Barbera. A reduced version of the 
original dataset with only thirteen variables is publicly available in the University 
of California, Irvine Machine Learning data repository, commonly used in testing 
the performance of newly introduced supervised and unsupervised classifiers. 
Particularly, in the unsupervised classification literature the wine recognition data 
has been considered to assess cluster analysis in information-theoretic terms via 
minimisation of the partition entropy [19], to prove the modelling capabilities of a 
generalized Dirichlet mixture [2], to evaluate the efficacy of employing distances 
based on non-Euclidean norms [5] and of Random Forest dissimilarity [20]. More 
recently, also parsimonious Gaussian mixture models have been applied to the 
Italian wines dataset [16]. 

Here our purpose is twofold: we want to explore the classification performance 
of a robust estimation based on mixtures of Gaussian Factors Analyzers, and we aim 
at obtaining realistic parameters for the subsequent simulation study. The dataset, 
available in the pgmm R package [17], contains 27 of the 28 original variables, since 
the sulphur measurements were not available. Initially, to perform model selection 
and detect the most suitable values of factors d and groups G, an adaptation to the 
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Table 1 Robust BIC [3] for IG 
different choices of the iff 2 3 
number of factors d and the Z 
number of groups G for the 1 | 9082.58 | 8282.92 | 8223.46 
robust MFA model on wine 2 | 8560.62 | 8107.62 | 8112.90 
data, trimming level w = 0.05 3 | 8352.26 | 8042.02 | 8199.38 
om ies eA seca r 4 |8160.77 | 7969.64 | 8315.23 
i li ia 5 | 8102.77 | 8044.03 | 8456.00 
6 | 8097.06 | 8165.67 | 8735.63 
Table 2 Classification table I 2 3 
for the robust MFA with 
number of factors d = 4, at : 59 0 0 
number of groups G = 3, Grignolino | 0 {71 | 0 
trimming level a = 0.05 and Barbera 0 0 | 48 


¢ = 20 on the wine data Trimmed observations are 


classified a posteriori according 
to the Bayes rule 


MFA framework of the robust Bayesian Information Criterion, firstly introduced 
in [3], has been considered. That is, BIC = —2.%,im(x; 6) + v° logn* where 
vf = (G—1+4+ Gp+ G(pd — d(d — 1)/2) + (Gp — 1) — 1/c) + 1) denotes the 
number of free parameters in the model (depending on the value of the constraint 
c) and n* = [n(1 — a)] the number of non-trimmed observations. Robust BIC 
for different choices of the number of factors d and the number of groups G are 
reported in Table 1, considering a trimming level a = 0.05 and c = 20. The value 
of the robust BIC is minimized for d = 4 and G = 2, suggesting a mixture with just 
two components. Careful investigation on this result highlighted that robust MFA 
methodology tended to cluster together Barolo and Grignolino samples as arising 
from the same mixture component, while clearly separating Barbera observations. It 
is worth recalling [7] that the wines in this study were collected over the time period 
of 1970-1979, and the Barbera wines are predominantly from a later period than the 
Barolo or Grignolino wines. Therefore, considering the nature of the phenomena 
under study and the risks related to rigidly selecting the number of components in a 
mixture model only on the basis of the results provided by an information criteria, 
such as BIC [13], we decided to employ a robust MFA with d = 4, G = 3, anda = 
0.05, leading to the classification matrix reported in Table 2. Employing a robust 
MFA rather than a Gaussian mixture leads to a 60% reduction in the number of 
parameters to be estimated (470 against 1217). Notice, in addition, that after robust 
estimation, also the trimmed observations can be a posteriori classified according to 
the Bayes rule, 1.e., assigning each of them to the component g having greater value 
of Dg (x, 0) = bp (X; Mg, A,A, + Wy). 

Results in Table 2 show that the robust MFA algorithm led to a perfect 
clusterization of the samples according to their true wine type. 

For completeness, the robust MFA algorithm was also applied to the more 
common thirteen variable subset of the wine data and comparison with the existing 
literature is reported in Table 3. The clustering performance with respect to the true 
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Table 3 Comparison of performance metrics for different methodologies on the thirteen variable 
subset of the wine data 


Performance metric 


Methodology Class recovery accuracy | Adjusted Rand index 
Partition entropy [19] 0.977 - 

Mixture of generalized Dirichlet [2] 0.978 - 

Neural gas [5] 0.954 - 

Random Forest predictors [20] - 0.93 

Parsimonious Gaussian mixture [16] 0.927 0.79 

Robust MFA [8] 0.994 0.98 


Reported metrics come from the original articles 


wine labels reports an Adjusted Rand Index equal to 0.98 with just one Grignolino 
sample wrongly assigned to the cluster identifying Barolo wines. Again then, the 
robust MFA methodology outperforms the results currently present in the literature 
for unsupervised learning on this specific dataset. 


4 Simulation Study 


The purpose of this simulation study is to show the effectiveness of estimating a 
robust MFA on a set of observations drawn from two luxury wines, Barolo and 
Grignolino, and identifying units presenting an adulteration. Considering the param- 
eters estimated obtained in Sect.3, the artificial dataset is generated simulating 
100 observations each, from Barolo and Grignolino components. Afterwards, the 
“contamination” is created decreasing by 15% the values of Fixed Acidity, Tartaric 
Acid, Malic Acid, Uronic Acids, Potassium, and Magnesium for 5 Barolo and for 
5 Grignolino observations. This procedure resembles the illegal practice of adding 
water to wine [12]. The problem of distinguishing adulterated observations from the 
real mixture components is addressed, together with the algorithm performance in 
correctly classifying the authentic units. 

We estimate a robust MFA with G = 2, p = 27, d = 4 and trimming level a = 
0.05. We compare our results with other popular methods: Partition around medoids, 
Gaussian mixtures estimated via Mclust, and Mixtures of patterned Gaussian factors 
estimated by pgmm. To perform each of the B = 1000 simulations, algorithms have 
been initialized following the indications of their respective authors: say 10 random 
starts at each run of AECM, default setting for the “build phase” of pam as in 
[14], applying model-based hierarchical clustering as per default setting in [6] for 
Mclust and 10 random starts at each run as suggested in [16] for pgmm. 


Detecting Wine Adulterations Employing Robust Mixture of Factor Analyzers 19 


Table 4 Average misclassification errors and ARI (percent average values on 1000 runs) 


AECM pam Mclust pgmm 
Misclassification error 0.0309 0.2935 0.2073 0.2314 
Adjusted Rand Index 0.9362 ‘| 0.5466 0.7184 0.6959 
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Fig. 1 Boxplots of the simulated distributions of f;[1], estimator for w,[1] = 10.45 (left panel); 
(1, 1], estimator for 2) [1, 1] = 0.1214 (right panel) 


Table 4 reports the average misclassification error and Adjusted Rand Index: the 
AECM algorithm reports a superb classification rate, with smaller variability of the 
simulated distributions for the estimated quantities, as shown in Fig. 1. 

For a fair comparison of the performance of the algorithms, we consider 3 
clusters for pam, Mclust, and pgmm; whereas we consider only 2 clusters for AECM, 
because in this approach the adulterated group should ideally be captured by the 
trimmed units. A value of c = 20 allows to discard singularities and to reduce 
spurious solutions [8]. The effects of the trimming procedure are shown in Fig. 2, 
where the different colours and shapes represent the obtained classification. Table 5 
reports the average bias and MSE for the mixture parameters (computed element- 
wise for every component). While an R package is under construction, R scripts 
containing the employed routines are available from the authors upon request. 

The present simulations show initial promising results in adopting robust MFA as 
a tool for identifying wine adulteration. Future research regards a novel approach for 
semi-supervised robust clustering, allowing for impartial trimming on both labelled 
and unlabelled data partitions. The aim is to jointly address methodological issues 
in robust statistics and clustering, as well as providing consistent statistical tools 
required in the increasingly demanding food authenticity domain. 
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Fig. 2 Clustering of the simulated data with fitted trimmed and constrained MFA. Trimmed 


oo 


observations are denoted by “x 


Table 5 Bias and MSE (in parentheses) of the parameter estimators fi, and =; 














| AECM | Mclust | pam | | AECM | Mclust | pgmm 
Hm, | 0.0019 |-0.0194 | 0.0069 | 5, | 0.0001 |-0.001 | 0.0257 

| (0.0029) | (0.0421) | (0.1022) | (0.0004) | (0.0022) hea 
fy | —0.0011 0.1522 |—0.0025 |X, |-0.0156 | —0.0164 0.0113 

| (0.0042) | (0.2376) | (0.1380) | | (0.0043) | (0.0043) fee 
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Simultaneous Supervised and ®) 
Unsupervised Classification Modeling od 
for Assessing Cluster Analysis 

and Improving Results Interpretability 


Mario Fordellone and Maurizio Vichi 


Abstract In the unsupervised classification field, the unknown number of clusters 
and the lack of assessment and interpretability of the final partition by means of 
inferential tools denote important limitations that could negatively influence the 
reliability of the final results. In this work, we propose to combine unsupervised 
classification with supervised methods in order to enhance the assessment and 
interpretation of the obtained partition. In particular, the approach consists in 
combining of the clustering method k-means (KM) with logistic regression (LR) 
modeling to have an algorithm that allows an evaluation of the partition identified 
through KM, to assess the correct number of clusters, and to verify the selection 
of the most important variables. An application on real data is presented to better 
clarify the utility of the proposed approach. 


Keywords Supervised classification - Unsupervised classification - Assessing 
clustering 


1 Introduction 


In unsupervised classification techniques, clusters of homogeneous objects are 
detected by means of a set of features measured (observed) on a set of objects 
without knowing the membership of objects to clusters. In these applications, the 
aim is to discover the heterogeneous structure of the data. In unsupervised classi- 
fication models, the principal approaches of cluster analysis [6] are: connectivity- 
based clustering better known as hierarchical clustering, centroid-based clustering, 
distribution-based clustering, density-based clustering, and many other parametric 
and non-parametric techniques [7]. 
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Conversely, supervised classification is based on the idea of forecasting the 
membership of new objects (output) based on a set of features (inputs) measured on 
a training set of objects for which the membership to clusters is known. Therefore, 
in these applications, the aim is to generalize a function or mapping from inputs to 
outputs which can then be used speculatively to generate an output for previously 
unseen inputs [4, 8]. Usually, a subsample (training), which is representative of 
specific groups, is selected and then this model is used as reference for the 
classification of new (unobserved) other objects. Training sets are selected based 
on the knowledge of the user. In supervised classification models we have artificial 
neural networks, naive Bayes classifiers, nearest neighbor algorithm naive, decision 
trees, logistic regression, generalized linear models, and many other parametric and 
non-parametric techniques. 

In unsupervised classification, we have important issues that could drastically 
influence results: (1) an unknown number of clusters, (2) an absence of variable 
selection that most contribute to clustering, and (3) a final assessment of clusters 
[3]. In other words, all the decisions taken to address the study can lead to different 
results and each single decision becomes crucial for the aim of our study and needs 
to be tested. 

In this work, we propose an algorithm based on the use of supervised classifica- 
tion modeling. In particular, our approach consists in the simultaneous application 
of k-means (KM) [10] and logistic regression (LR) [1] modeling. We will prove that, 
by using LR, we have effective inferential tools for choosing the number of clusters, 
selecting the most important variables for the clustering, and assessing the quality 
of clusters. 

The paper is structured as follows: in Sect.2 we present our proposal for the 
simultaneous application of unsupervised and supervised classification modeling, 
in Sect. 3 we show an application on real data, and finally, in Sect. 4 we try to give 
some suggestions and concluding remarks on the work. 


2 Proposal 


In unsupervised classification modeling, we are not interested in prediction because 
we do not have an associated response variable y like in supervised classification 
modeling. Therefore, this paper proposes to simultaneously apply unsupervised (i.e., 
KM) and supervised classification (i.e., LR) approaches, where the latter aims to 
evaluate and to improve the former with the additional data structure information. 
We will call this approach k-means-logistic regression (KM-LR). In particular, KM- 
LR is composed of the following principal steps: 

Given the n x J data matrix X, for K = 2, ..., Kmax, where Kmax is 
the maximum number of clusters the researcher thinks the data might have, the 
algorithm works as follows: 
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Algorithm 1 KM-LR algorithm 


1: for k = 2 to Kmax do 


Randomly initialize the membership matrix U; 
Compute the centroids matrix by X = (u u)—! UT x; 
Minimize the objective function |x - UX| ss with respect to the membership matrix U; 


ae bY 


Update the centroids matrix Xy = (Un? Un)! Un? X given the new assignment matrix Un; 
7 if |X— UnXnl|” > @; 
X = Xn, U= Un, repeat steps 5-6; 


else 
exit loop; obtain the gz categorical cluster vector; 

8: end if 

9: Multinomial logistic regression step 

10: LR is estimated on gz, with explanatory variables X, for estimating the probabilities for its k — 1 response 
categories zz (x), and to estimate the probabilities for its baseline category m0 (x); 

11: if some LR estimated coefficient is not 5% statistically significant; 


remove the corresponding variables from the matrix X; 
repeat steps 2-10; 

12: end if 

13: end for 


At the end, we obtain Kmax — | identified partitions (with a different number of 
clusters k), together with a reduced set of statistically significant variables and a set 
of inferential tools to assess the quality of the partition. The best partition (with the 
optimal number of clusters k) is identified in correspondence of the largest increase 
of a x7-test computed on the partitions obtained by KM and LR. In this way, through 
the analysis of the LR results (e.g., explained variance, parameters significance, 
residual variance), we have an evaluation of the partition obtained by KM. In fact, 
a good performance of the LR model on the response variable derived by the KM 
outcome means that the variables included in the model provide a good explanation 
for the group structure in the data. Moreover, through the LR coefficients analysis, 
we can see which variables contribute the most to identifying the group structure 
and to what extent they do so (by analyzing statistical significance, value estimates, 
and signs of coefficients). 

Note that the algorithm monotonically decreases the loss function or at least does 
not increase it. However, it does not guarantee to stop at the global minimum of the 
loss function. For this reason, it is recommended to use a large number of randomly 
started runs to find the best solution. The predictive accuracy of the methodology can 
be assessed by cross-validation to give an insight into how the model will generalize 
to an independent data set. In a following paper we will include a cross-validation 
procedure and a simulation study to assess the predictive accuracy and evaluate the 
performances of the algorithm. 

In the next section, an application on real data is presented. 
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3 Application on Real Data 


In this section a real data application of KM-LR is presented. The data set is named 
Wine Data [5]. It is the result of a chemical analysis of wines grown in an Italian 
region, derived from three different cultivars. 

The 13 constituents were measured on 178 types of wine from the three cultivars: 
59, 71, and 48 instances are in class one, two, and three, respectively. The 13 
chemical continuous attributes of the wine data set are: 1. Alcohol (Alc), 2. Malic 
acid (Mal), 3. Ash (Ash), 4. Alkalinity of ash (AAsh), 5. Magnesium (Mg), 6. 
Total phenols (Phe), 7. Flavonoids (Fla), 8. Non-Flavonoids phenols (NPhe), 9. 
Proanthocyanidins (ProA), 10. Color intensity (Col), 11. Hue (Hue), 12. OD280- 
OD315 of diluted wines (ROD), and 13. Proline (Pro). 

In the analysis, we have tried to select the optimal number of clusters without 

considering the a priori information that K = 3, and using the KM-LR algorithm, 
ie., through the maximization of the increase of the x7-test computed on the 
partitions obtained by KM and LR. For comparison purposes, two other approaches 
have been used. The procedure has been randomly repeated 50 times from 2 to 10 
clusters using a single random start. In Table 1, the results obtained by KM-LR (first 
column), the sequential application of KM followed by the Gap-method proposed 
by Tibshirani [12] (second column), and the sequential application of KM followed 
by Calinski and Harabasz’s [2] criterion (third column) have been reported. 
The best performance has been obtained by the KM-LR approach, where the optimal 
number of clusters has been captured 36 times out of 50 (72%) runs. In contrast, 
the KM-Gap-method obtained the worst performance, since the optimal number of 
clusters was captured only 5 times (10%). Thus, the KM-LR approach seems to 
reduce the effect of the local minima problem of the KM algorithm, and this is even 
more relevant in the case no modification of the KM partition as proposed by the 
KM —> Gap-method and KM —> Calinski-Harabasz method. 


Table 1 Optimal K selection from 2 to 10 clusters on the 50 random repeat using a single random 
start 


KM-LR KM —> Gap-method KM —> Calinski—Harabasz 
K Count Percent Count Percent Count Percent 
2. 0 0.00 0 0.00 0 0.00 
3 36 72.00 5 10.00 22 44.00 
4 10 20.00 0 0.00 5 10.00 
5 2 4.00 0 0.00 3 6.00 
6 2 4.00 0 0.00 3 6.00 
7 0 0.00 2 4.00 0 0.00 
8 0 0.00 1 2.00 0 0.00 
9 0 0.00 15 30.00 6 12.00 
10 0 0.00 27 54.00 11 22.00 


Total 50 100.00 50 100.00 50 100.00 
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Table 2 Estimation results 


: ri Estimate | SE t-stat p-value 
Stieeentint hel alee peg Const. | 2.0169 | 0.0296 | 68.2200 | 2.66E—122 
partition including only Alc —0.2306 | 0.0465 | —4.9579 | 1.76E—06 
predictors with a 5% Mal | —0.0865 | 0.0382 | —2.2674 | 2.47E—02 
significant coefficient Mg —0.1264 | 0.0353 | —3.5808 | 4.51E—04 

Fla —0.2012 | 0.0786 | —2.5597 | 1.14E—02 
Col —0.0806 | 0.0516 | —1.5634 | 1.200E—02 
Hue 0.0970 | 0.0474 | 2.0492 | 4.20B—02 
Pro —0.3627 | 0.0498 | —7.2806 | 1.31E—11 


178 observations, 164 error degrees of freedom; Disper- 
sion: 0.138, AIlCc = 160.34, BIC= 185.95; R-squared- 
adj. = 0.8135; F-statistic: 93.70, p-value =5.19E—55 


In Table 2 we show the estimation results of LR applied to the group labels 
identified through the KM model as a response variable and include only variables 
with significant coefficients as predictors. 

From Table 2 we can note that the model shows good performance and about 80% 
of the total deviance is explained (i.e., R? a 0.81). The variables Ash, Alkalinity of 
Ash, Total phenols, Non-Flavonoids phenols, Proanthocyanidins, and the OD280- 
OD315 of diluted wines have been excluded because these were not statistically 
significant at the 5% level. In Fig. | the partitions identified by KM-LR (highlighted 
with different symbols) on the 7 included variables have been represented. 

The partition seems well represented on most pairs of variables, because it is 
represented by the statistically most significant variables. Moreover, the partition 
found by the KM-LR approach better identifies the real data partition identified by 
the three different cultivars. 

Table 3 shows (1) the confusion matrix between the real data partition and the 
KM partition (i.e., KM applied to the complete data) and (2) the confusion matrix 
between the real data partition and the KM-LR partition. 

The misclassification rate and the adjusted Rand index (ARI) [11] applied on 
the left table (i.e., the real partition versus the KM partition) are equal to 0.3708 
and 0.2977, respectively; these same indices applied to the right table (i.e., the real 
partition versus the KM-LR partition) are equal to 0.1818 and 0.5465, respectively. 
We recall that ARI has a value between 0 and 1, with 0 indicating that the two 
data clusterings do not agree on any pair of points and 1 indicating that the data 
clusterings are identical. 

Moreover, by applying LR to the real data partition we obtain the following 
confusion matrix between the real partition and the one fitted by LR (Table 4). 
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Fig. 1 The three clusters identified by KM-LR represented on the variables included in the model 


Table 3. Confusion matrix between: (1) real data partition and KM partition; (2) real data partition 
and KM-LR partition 


Total Real 


Real C | Total 






Ci 59 
c@ [9 fo [i [a |e [3 |e [2 [7 
2 


Total 43 |178 


Table 4 Confusion matrix 
between real data partition 
and LR partition Real C3 | Total 


ci {is | 44 | 0 _ | 59 
c [6 |e |3 | 7 


Total ie 178 
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The performance of KM-LR is also better. In fact, the misclassification rate and 
ARI applied to Table 4 are equal to 0.5225 and 0.0247, respectively. In Table 5, 
the performances obtained by both LR applied to the real partition and KM-LR 
are shown. We note that the diagnostic indices obtained by KM-LR are much than 
those obtained by the LR application on the real data partition. Furthermore, in the 
application of LR on the real data partition, only the variable Color intensity has 
obtained a statistically significant coefficient and then, only this variable has been 
included in the model. 

Finally, to obtain a quality measure of the clusters, a MANOVA model [9] on the 

real data partition and on that obtained by the KM and KM-LR models has been 
applied (Table 6). 
The null hypothesis is rejected in each of the three cases, i.e., the means of 
each group are not the same j-dimensional multivariate vector, and any difference 
observed in the sample is not due to random chance. However, we can note that 
the most significant value of 4 is derived in the KM-LR partition. In Fig. 2, the 
distributions of the three KM-LR clusters on the reduced set of variables are 
shown. 


Table 5 Comparison LR KM-LR 

pelween ee F-Statistic 14.5000 | 93.7000 
p-value 0.0002 S.19E—55 
R-squared-adj. 0.0710 0.8135 
AICc 403.3673 | 160.3400 
BIC 409.6623 | 185.9500. 


Table 6 MANOVA results obtained on the real data partition and on that obtained by k-means 
and k-means-logistic regression 


Chi-Squared Degrees of freedom 
Wilk’s Lambda _ | approximation chi-squared p-value Partition 
Const. | 0.2052 267.6509 26 0.00E+00 | Real 
Group | 0.7904 39.7581 12 7.89E—05 
Const. | 0.2043 268.3793 26 0.00E+00 | KM 
Group | 0.7609 44.3934 12 1.31E—05 
Const. | 0.2303 248.1821 26 0.00E+00 | KM-LR 





Group | 0.8063 36.3558 12 2.80E—06 
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Fig. 2, Boxplots of the three KM-LR cluster distributions represented on the variables included in 
the model 


4 Concluding Remarks 


In the unsupervised classification approaches, the unknown number of clusters and 
the lack of assessment of the final partition are crucial issues that could negatively 
affect the reliability of the results. In this work we proposed an algorithm that 
combines KM and the LR modeling to evaluate the partition identified through KM, 
to assess the correct number of clusters, and to verify the selection of the most 
important variables. We did this by using well-known inferential tools that allowed 
us to statistically confirm the obtained results. 

The application on real data shows that this methodology obtains better perfor- 
mance than the usual KM approach, reducing the effect of local minima. Moreover, 
KM-LR represents a useful tool to identify the variables that better contribute 
to defining the group structure in the data and removing the statistically non- 
significant variables from the model. In this way, we have a parsimonious set of 
variables that define the best partition of the data. Thus, the methodology seems 
promising. However, in a following work, we wish to better assess, using an 
extensive simulation study, the performance of the proposed methodology. 
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A Parametric Version of Probabilistic M®) 
Distance Clustering Greet 


Christopher Rainey, Cristina Tortora, and Francesco Palumbo 


Abstract Probabilistic distance (PD) clustering method grounds on the basic 
assumption that the product between the probability of the unit belonging to a 
cluster and the distance between the unit and the cluster center is constant, for 
each statistical unit. This constant is a measure of the classificability of the point, 
and the sum of the constant over units is referred to as the joint distance function 
(JDF). The parameters that minimize the JDF maximize the classificability of the 
units. The goal of this paper is to introduce a new distance measure based on 
a probability density function, specifically, we use the multivariate Gaussian and 
Student-t distributions. We show using two simulated data sets that the use of a 
distance based on these two density functions improves the performance of PD 
clustering. 


Keywords PD clustering - Clustering algorithm - Gaussian distribution - 
Multivariate Student-t distribution 


1 Introduction 


Data clustering and classification are among the most investigated domains in statis- 
tics and machine learning. In general, clustering and classification methods can be 
divided into hierarchical and non-hierarchical methods. Non-hierarchical methods, 
the focus of this paper, produce a partition of the individuals into a specified number 
of groups by optimizing a numerical criterion [8]. Specifically, statistical non- 
hierarchical approaches are generally divided into two main categories: (1) heuristic 
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(non-parametric) and (2) model-based approaches. The heuristic approach does not 
make any assumption about the class structure, and the criterion to optimize is 
generally based on a distance or dissimilarity measure. Class membership can be 
defined by a crisp function (clusters are mutually exclusive) or by a fuzzy function 
where a membership function is defined in [0, 1] for each unit and group, summing 
to 1 over groups, for each unit. Under this paradigm, the most used methods are k- 
means [12] and fuzzy c-means clustering [3]. The model-based approach postulates 
a formal statistical model for the classes—for example, that data were sampled from 
the Gaussian density—and it assumes that groups differ only by the density function 
parameter(s). Under this approach, the optimization problem consists in finding 
the parameters that maximize the likelihood function. Because the membership is 
unknown a common approach is to maximize the complete data likelihood, more 
details can be found in [13]. As the membership function is derived as a probability, 
it naturally varies in [0, 1] [6]. When the membership function is defined in [0, 1] 
the clustering approach is also defined as probabilistic. 

This paper has a twofold aim: it builds a bridge between model-based and 
distance clustering by reformulating the probabilistic distance (PD) clustering 
algorithm, first introduced in 2008 by Ben-Israel and Iyigun in [2], using a density 
function; then, it proposes a probabilistic distance clustering algorithm using a 
Gaussian and a Student-t multivariate density distributions. Simulated data sets, 
using two different scenarios, have been used to show the algorithm performance. 

The chapter is arranged into five sections including the present introduction. 
Section 2 briefly presents the PD clustering algorithm [2]; Sect.3 reformulates 
the PD algorithm under a parametric paradigm; Sect.4 presents an example on 
simulated data sets; and the last section provides some concluding remarks. 


2 Probabilistic Distance Clustering 


Probabilistic distance (PD) clustering was proposed by [2] in a distance based and 
distribution free context. It is a non-recursive, partitioning iterative algorithm, and it 
assumes that the number of clusters is known a priori. Given some random centers, 
the PD clustering algorithm assumes that the probability of a point belonging to a 
cluster is inversely proportional to the distance from the center of that cluster [10]. 
Suppose we have a data matrix X with n units and J variables, and consider K 
(non-empty) clusters. Let’s denote by x; a generic J-dimensional data vector and 
with c; a general J-dimensional vector of centers, withk = 1,...,K,i=1,...,n. 
PD clustering is based on two quantities: the distance of each x; from each cluster 
center ¢;, denoted d(x;, cx), and the probability of each point belonging to a cluster, 
ie., p(x;,cex), fork = 1,..., K andi = 1,...,n. The relationship between them 
is the basic principle underlying the method. 

For short, we define pjx := p(xj, cx) and djx := d(x;, cx). PD clustering is 
based on the principle that the product of the distances and the probabilities is a 
constant depending only on x; [2]. Denoting this constant by F(x;), the following 
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equality holds: 
Pikdik = F (xj), (1) 


where F'(x;) depends only on xj, i.e., F(x;) does not depend on the cluster k = 
1,..., K. As the distance from the cluster center decreases, the probability of the 
point belonging to the cluster increases. Starting from (1), it is possible to compute 


Dik as 


d; 
pig = —Llnwe dim _ kip Ke (2) 


K , 
nel ee dir 


Then, from (1) and (2), it is possible to define the value of the constant F(x;) = 
Dik dik as 


K 4 
F(x;) = —Llm=1 dim __ k=1,...K. (3) 


K ? 
pe = Tian dik 


The quantity F(x;) is a measure of the closeness of x; to the cluster centers, and 
it determines the classificability of the point x; with respect to the centers c;, for 
k =1,..., K. The smaller the F(x;) value, the higher the probability of the point 
belonging to one cluster. If the distances between the point x; and the centers of 
the clusters are all equal to dj, then F(x;) = dj/K and all of the probabilities of 
belonging to each cluster are equal, i.e., pix = 1/K.The whole clustering problem 
consists of the identification of the centers that minimize the sum over i of F(x;). In 
[2], the authors suggest using p instead of p in (1) because it is a smoothed version 
of the problem, making the optimization function convex. The resulting quantity is 
called joint distance function (JDF): 


IDF Sys 4 aap (4) 


Extensive details on PD clustering are given in [2]. In 2016, Tortora et al. proposed 
a factor version of the method to deal with high-dimensional data [18]. 


3 Methodology 


A parametric version of the PD algorithm can be obtained using a dissimilarity 
measure based on a probability density function. Specifically, let’s define with 
My; = max(f (xj; Mz, 9x)) the quantity log(Mz f (xi; Mz, 6,)—/) is a dissimilarity 
measure, where f(x;; €;,9x) is a symmetric unimodal density function with 
location parameter 4, and parameter vector 0;. Dealing with a generic density 
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function f (xj; @;,9%) one of the parameters is the mean, or alternatively it is the 
non-central parameter. 

A general measure d(x, y) is a dissimilarity measure if the following conditions 
are verified [16, p.404]: 


1. d(x, y) > 0. 
2.dx,y)=08x=Yy. 
3. d(x, y) =d(y, x). 


Here we prove the following proposition. 


Proposition Let f(x;; @y,0%) be the generic symmetric unimodal multivariate 
density function of the random variable X with parameter 0, and location parame- 
ter Ly, then 


Mk 
d(x;, Ly) = log (5) (5) 
satisfies all the three properties and it is a dissimilarity measure fork =1,..., K. 
1. d(xi, Uy) > 0, Vx;. 
Proof 
Q < LO He OO) — Mk = 1> oe ( Mk ): 
Mk Sf (X13 Mes Ok) Ff (%is Me, Ox) 
oO 


2. (Xi, My) =0 SX; = My. 
2a.xj; = by => d(Xi, Uy) = 0 VX;. 


Proof 





Mk 
Xi = My => f (Xi Me, Oe) = f(MES Me, OK) = Me => ae 1 = log (1) = 0. 


Oo 
2b. d(xj, ty) =O Xj = My, VXi- 
Proof 


| [-——— ) a 
te) — = 
S\ Fai my. OD) Fei my. OW) 
=> f(Xis My, Ok) = Me = f (Mes My, Oe) > Xi = Mg. 


oO 


3. d(Xi, Me) = A(Mg, Xi), VXI. 
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Proof Given 0, 


M, M, 
Sf (Xis Ma Ok) = f (Uys Xi, Ox), => log (sa) = log (74s) . 


My 


Therefore, the quantity log (ss 
F(Xi5 Mes OK) 


) can be used in Eq. (4). 


3.1 Gaussian PD Clustering 


Gaussian PD clustering is obtained by putting 


ae Mt ) 
a oe 
EEN Gis My. ZO) 


in (4), where $(x;; @;,, Xx) is the probability density function of a multivariate 
Gaussian distribution. In this case the loss function in Eq. (4) becomes 


IDF = oy Dy P} log(Ma) + 7, Dy 4}, log((22)/ Zul) 
+ ky Pe i — way Ey xi — mp). (6) 


The parameters that minimize the objective function in (6) can be obtained by 
differentiating with respect to wz, and X,%. Specifically, at a generic iteration (t+/), 
the parameters that minimize (6) are 


n 2 
(+1) Vizl PipXi 
k 


= (7) 
et Dix 
w (ye = (t+1) Yeo (t+Dyr 2 
gery = 1 i Bk; (x; Kh, Y Pik (8) 


n 2 
iat Pe 
Our iterative algorithm can be summarized as follows: 


1. Random initialization of w;, and initialization of X', as identity matrix, fori = 
1,...,Nandk=1,...,K; 

. update pj, according to (2); 

. update mz, according to (7); 

. update 2; according to (8); 

. if w;, changed go to Step 2, otherwise stop. 


nb Wh 
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In its parametric formalization for Gaussian (and Student-r) distributions, the JDF 
depends on both mw, and X, (with k = 1,...,K), then the quantity in (6) can 
be minimized for given values of 2% ;; therefore, the convergence of the JDF to 
a minimum is not guaranteed. The same occurs for the PD clustering adjusted 
for cluster size algorithm [11], where the authors demonstrate that a convenient 
stopping rule is based on the stability of the solutions. In step 5, the algorithm stops 
when the difference in 4, from the previous step solution is negligible. It is worth 
noting that the quantity in (6) can be written as JDF = 37"_, iL, p3,(log(Mx) — 
log(@ (xi; My, Xk))) with My > H(xj; hy, HX); therefore, for every k = 1,..., K 
the function is upper-bounded for not degenerate density functions. 


3.2 Student-t PD Clustering 


Based on the same idea, we derived a Student-t PD clustering. In (6) the density 
function of a multivariate Gaussian distribution is replaced by the density function 
of a multivariate Student-t distribution, 





r (34) 1212 
eee La4J)’ (9) 
(roy! 7 (3) {1+ Se mB) |? 


Vv 


where 6 (x, w, XY) = (x — py =e — p). The JDF becomes 





n K n K 
up + J i 
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The parameters that optimize Eq. (10) can be found by differentiating with respect 
to wy, X%, and vz. Specifically, at a generic iteration (t+/), the parameters that 
minimize the (10) are 


yt) — iat WikXi a 
vei Wik 
2 
D; 
re “ © > 
UL + (xi, mf =) 
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The degree of freedom ut) is the solution to the following equation: 





(t+1) (t+1) 
3 ie |v (S) -¥ Lay meer ess 1 eee ze) 
Pik 2 2 2m | 2°” yO 


5 (xi. pee zf*?) 


% ja = +6 (xi pet M pa ’) 
(13) 


6I°(v) 
where Y (v) = Toy 
Our iterative algorithm can be summarized as follows: 


1. Random initialization of j4;, initialization of X; as identity matrix, and vy = 20, 
fori=1,...,Nandk=1,...,K; 

. update p;x according to (2); 

. update mz, according to (11); 

. update X; according to (12); 

. update vz solving (13); 

. if w;, changed, then go to Step 2, otherwise stop. 


DnmMBW PY 


4 Application on Simulated Data Sets 


Gaussian PD clustering (GPDC), Student-t PD clustering (TPDC), and standard PD 
clustering (PDC) algorithms have been compared on two simulated scenarios. For 
each scenario we generated 100 data sets, we setk = 2, J = 2,n = 900 (1; = 400, 
nz = 500), and we used the following parameters: 


; , 1 —0.5 1 0.5 
bh, =(0,0), w=(2,4), Y= és 4 , and Y= . 


We used the software R [14], the standard PD clustering algorithm is fitted by the 
function PDclust, package FPDclustering [17]. For all the algorithms we 
used 5 random starts to find the starting points. 

For scenario a, each cluster has been generated from a multivariate Gaussian 
distribution using the function rmvnorm from the R package mvtnorm [7]. 
For scenario b, each cluster has been generated from a multivariate Student-t 
distribution using the function rmvt from the same R package. Table 1 shows 
the true parameters compared with the average and the standard deviation of the 
estimated parameters. In Table 1, o,;; refers to the elements of %;%. In scenario 
a, all the methods give good estimates of the cluster means, and the parameters 
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obtained with TPDC are the closest to the true parameters. PDC doesn’t estimate 
the covariance matrices of the clusters; both TPDC and GPDC give good estimates 
for the covariance matrices. It is worth noticing that TPDC estimates for o are lower 
than the true parameters and lower than the estimates obtained with the GPDC. This 
can be explained by the estimates of the degrees of freedom that are smaller than 
the true values. In scenario b, all the methods give good estimates of the means; 
however, the estimates for the covariance matrices obtained using TPDC are closer 
to the true value when compared to GPDC. TPDC has an extra parameter compared 
to GPDC, the degrees of freedom, and this explains the higher variance for GPDC. 
The standard deviation (SD) of the mean estimates is similar for the three methods, 
the SD for the variance and covariance estimates for the GPDC is slightly higher, 
this difference decreases as the cluster dimensions increase. 

To compare the clustering performance we use the adjusted Rand index (ARI) 
[9]. The ARI compares predicted with true classification, and corrects the Rand 
index [15] for chance; its expected value under random classification is 0, and 
it takes a value of 1 when there is perfect class agreement. In both scenarios 
the clusters overlap, see Fig. 1, despite that all the methods detect the clustering 
structure. Table 2 shows the average ARI values; the lowest ARI is 0.92 for scenario 
a and 0.86 for scenario b. TPDC outperforms the other methods in both scenarios 
with an ARI of 0.96 and 0.90, respectively. This is expected because TPDC is the 
most flexible method. The lower ARI of PD clustering is explained by its constant 
covariance structure. In Table 2 we also compare the proposed methods with k- 
means, Gaussian mixture models (GMM), and Student-t mixture models (TMM). 
For this comparison we used the R functions: kmeans, gpcm (option “VVV”) 
package mixture [4], and teigen (option “UUUU”) package teigen [1], 
respectively. 


Fig. 1 Example of data sets 
generated using a mixture of 
multivariate Gaussian 
distributions (top) and a 
mixture of multivariate 
Student-t distributions with 8 
degrees of freedom (bottom) 
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Table 2 Mean ARI and standard deviation (SD) on 100 data sets 


GPDC TPDC PDC k-means GMM TMM 

| Mean ‘SD Mean. SD Mean ‘SD Mean |SD |Mean/|SD | Mean ‘SD 
Scenario a |0.95 | 0.02 |0.96 |0.01 |0.92 |0.02|0.92 |0.02/0.96 |0.01|0.96 | 0.01 
‘Scenario b | 0.89 | 0.03 |0.90 |0.02 | 0.87 |0.02/0.87 |0.02|0.90 |0.02|0.90 | 0.02 


PD clustering and k-means give the same ARI in both scenarios. The data sets 
in scenario a and b have been generated as mixture of multivariate Gaussian and 
multivariate Student-t distributions, respectively; therefore, as expected, the two 
methods give the best performance, together with the proposed PD-t algorithm. 


5 Conclusion and Future Work 


In probabilistic distance (PD) clustering, given some random centers, the probability 
of a point belonging to a cluster is assumed to be inversely proportional to 
the distance from the center of that cluster. The algorithms perform very well; 
however, it shows limitations on clusters with difference variance or when variables 
are correlated. We proposed a parameterized version of the algorithm using a 
dissimilarity measure based on a probability density function. Specifically, we 
used the Gaussian and the Student-t distributions, this allows us to overcome the 
mentioned issues. Using simulated data sets, we showed that both algorithms can 
correctly estimate the parameters of the population and have great performance in 
terms of ARI. The same algorithm can be extended to other distributions; moreover, 
Iyigun and Ben-Israel [11] proposed the PD clustering algorithm adjusted for cluster 
size. A natural development of this work is to develop a parametric version of the PD 
clustering algorithm adjusted for cluster size and to compare it with the expectation— 
maximization algorithm [5]. 


Acknowledgements The authors are very grateful to the two anonymous referees for their detailed 
and helpful comments to finalize the manuscript. 


References 


1. Andrews, J.L., Wickins, J.R., Boers, N.M., McNicholas, P.D.: teigen: an R package for model- 
based clustering and classification via the multivariate f distribution. J. Stat. Softw. 83, 1-32 
(2017) 

2. Ben-Israel, A., Iyigun, C.: Probabilistic d-clustering. J. Classif. 25, 5—26 (2008) 

3. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. 
Geosci. 10, 191-203 (1984) 


A Parametric Version of PD Clustering 43 


4. 


11. 


12. 


13: 
14. 


15. 


16. 


17. 


18. 


Browne, R.P., ElSherbiny, A., McNicholas, P.D.: FCM: mixture: Mixture Models for Cluster- 
ing and Classification. R package version 1.4 (2015). https://cran.r-project.org/web/packages/ 
mixture/index.html 


. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the 


EM algorithm. J. R. Stat. Soc. B-met Ser. B 39, 1-38 (1977) 


. Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Cluster Analysis. Wiley Series in Probability 


and Statistics. Wiley, New York (2011) 


. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: multivariate 


normal and f distributions. R package version 1.0-7 (2009). https://cran.r-project.org/web/ 
packages/mvtnorm/index.html 


. Gordon, A.D.: Classification, 2nd edn. Chapman and Hall/CRC, Boca Raton (1999) 
. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193-218 (1985) 
. Iyigun, C.: Probabilistic distance clustering. Ph.D. thesis, State University of New Jersey 


(2007) 

Iyigun, C., Ben-Israel, A.: Probabilistic distance clustering adjusted for cluster size. Probab. 
Eng. Inform. Sci. 22, 68—125 (2008) 

MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: 
Proceedings of the Fifth Berkeley Symposium, vol. 1, pp. 281-297 (1967) 

McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley Interscience, New York (2000) 

R Core Team: R: a language and environment for statistical computing. R Foundation for 
Statistical Computing, Vienna (2016) 

Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 
66, 846-850 (1971) 

Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 2nd edn. Academic Press, New York 
(2003) 

Tortora, C., McNicholas, P.D.: FPDclustering: PD-clustering and factor PD-clustering. R 
package version 1.1 (2016). https://cran.r-project.org/web/packages/FPDclustering/index.html 
Tortora, C., Gettler-Summa, M., Marino, M., Palumbo, F.: Factor probabilistic distance 
clustering (FPDC): a new clustering method. Adv. Data Anal. Classif. 10, 441-464 (2016) 


An Overview on the URV Model-Based ®) 
Approach to Cluster Mixed-Type Data oa 


Monia Ranalli and Roberto Rocci 


Abstract In this paper, we provide an overview on the underlying response variable 
(URV) model-based approach to cluster and, optionally, simultaneously reduce 
ordinal and, optionally, continuous variables. We summarize and compare its main 
features discussing some key issues. An example of application to real data is 
illustrated comparing and discussing clustering performances. 


Keywords URV .- Finite mixture models - Ordinal data - Composite likelihood 


1 Introduction 


A frequently used clustering model is the finite mixture of Gaussians (FMG) [15], 


G 
f(y: 0) = D> ped (¥: Mg, Ee): (1) 


g=l 


where dp (y: Me, 2 e) is the P-variate Gaussian density with mean fl, and covari- 
ance matrix 2, and pj, p2,..., pg is the set of positive weights that sum to 
1. Usually each Gaussian density (component) is interpreted as a cluster (sub- 
population) and the corresponding weight as the probability that an observation 
comes from it. FMG works on continuous variables, but some issues arise on ranked 
data due to the lack of metric properties: the category scores are arbitrary and the 
assumption of normality is not true anymore. To analyse ordinal data two main 
approaches exist: item response theory (IRT, [1]) and underlying response variable 
(URV, [19]). In the first one, the ordinal variables are assumed to be independent 
given a set of latent continuous variables that have a clustering structure (for 
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example, they can be distributed as a FMG [4]). On the other hand, URV is a 
way to overcome the within-independence limitation: the observed variables are 
a categorization of underlying non-observable continuous variables distributed as 
a FMG (see, for example, [8, 12, 20, 22, 23]). This has been extended in several 
ways. Everitt [8] introduces a mixture model for mixed data. The joint distribution 
of the variables is a homoscedastic FMG where some variables are observed as 
ordinal. In particular, the ordinal variables are seen as generated by thresholding 
some marginals of the joint FMG with different thresholds in each component. 
The model proposed by Lubke and Neale [12] is specified for ordinal variables 
that are generated by thresholding a heteroscedastic mixture of Gaussians, whose 
covariance matrices are reparametrized as a factor analysis model. Nevertheless, 
in both cases the estimation of the model by maximum likelihood requires the 
numerical computation of multidimensional integrals that is time consuming. Due to 
computational reasons, they can include only few ordinal variables. In the sequel we 
summarize the main results obtained by adopting a composite likelihood approach. 
We first present the model with only ordinal variables [20], then we extend it to the 
case where noise dimensions or variables are present [23] and finally we generalize 
the proposal to the mixed-type data [22]. 


2 Clustering Ordinal Data 


We start by describing the key figures for the proposal of [20]. This aims at capturing 
the cluster structure underlying the data without requiring the local independence 
assumption that may result to be too restrictive in practice. Let x1,..., xp be ordinal 
variables and c; = 1,..., Cj; the associated categories fori = 1,..., P. There 
are R = We C; possible response patterns x» = (x1 = Cl,...,Xp = Cp), 
with r = 1,..., R. The ordinal variables are generated by thresholding y that is a 
multivariate continuous random variable distributed as a FMG (1). The link between 
x and y is expressed by a threshold model defined as xj = cj © Fa < yi < 
ye. Let W = lites .e+y) DG-1, By,---, Ug, X1,--., 2G, r} be the set of model 
parameters, where J” is the set of vectors y“). The probability of response pattern 
X; is 


G (1) (P) G 
¥eq cp 
Prix; W) = )~ pe / re / py PM Mes Ze dy =D) petr(Wy, Ze. T), 
g=l1 Yep —1 Yep—1 g=l1 


(2) 
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where 71; (@,, 2g, I’) is the probability of response pattern x, in cluster g. Thus, for 
a random i.i.d. sample of size N the log-likelihood is 


R G 


Lp; X) = )°n, log Drs (wu, Ze, T) ’ (3) 


r=1 g=l 


where n, is the observed sample frequency of response pattern x, and sae ny = 
N. The maximization of (3) is quite time consuming and becomes infeasible 
when P increases due to the presence of multidimensional integrals. For this 
reason, model parameters are estimated through an EM framework maximizing 
the pairwise log-likelihood, i.e. the sum of all possible log-likelihoods based on 
the bivariate marginals. The estimators obtained have been proven to be consistent, 
asymptotically unbiased and normally distributed. In general they are less efficient 
than the full maximum likelihood estimators, but in many cases the loss in efficiency 
is very small or almost null [25]. 


3 Simultaneous Clustering and Reduction 


In order to identify the discriminative dimensions, in the previously described 
model, Ranalli and Rocci [23] assumed that there is a second order set of P latent 
variables, say factors, y, formed of two independent subsets. In the first one, there 
are Q (with Q < P) factors that have some clustering information, defining the so- 
called discriminative dimensions. In the second set there are OQ = P — Q noise 
factors, i.e. noise dimensions. Technically, the Q informative elements of y are 
assumed to be distributed as a mixture of Gaussians with class conditional means 
and variances equal to E(¥2 | g) = N, and Cov(y? | g) =2 g» respectively. The 
O noisy elements do not contain information about the cluster structure, it follows 
that they are independent of ¥2 and their distribution does not vary from one class 
to another: E(¥2 | g) = m9, Cov(y2 | g) = Qo. The link between the latent 
variables and the factors is given by y = Ay, where A is non-singular. The variables 
y that are most correlated with the factors V2 are identified as noise. It is also worth 
noticing that, exploiting the independence between y2 and ¥2, it is possible to 
compute proportions of each latent variable’s variance that can be explained by the 
noise factors and, by one’s complement, by the discriminative factors. They are very 
helpful in identifying the noise variables. For more details, see [23]. 
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4 Clustering Mixed-Type Data 


Finally, we summarize the extension of [20] to the mixed-type data case [22] 
(called as “HetMixtureMixed”). Let x = [x1,...,xg]' and y2 = [yo+,---. yp]! 
be Q ordinal and Q = P — Q continuous variables, respectively. Under the 
URYV, the ordinal variables x are considered as a discretization of a continuous 
multivariate latent variable y2 = [y1,.-.., yq]’. To accommodate both cluster 


structure and dependence within the groups, we assume that y = [y2’, y2! | follows 
the heteroscedastic Gaussian mixture (1). For a random iid. sample of size N, 


(x1, y2),..., (wv, YQ), the log-likelihood is 


N G = . 
LW) = Dlg | D> ped aty?: m2, Eran (wel?, E2'2, r) 
n=1 g=l 


= = (dd) (Q) = = 
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conditional joint probability of response pattern x, = ce”, or Ce given the 


cluster g and the continuous variables y2. To overcome the computational issues 
caused by the presence of multidimensional integrals in the likelihood, a composite 
likelihood is used, composed of three block-estimating functions: the full likelihood 
of a FMG for the continuous variables, the pairwise likelihood of a latent mixture 
of Gaussians for the ordinal variables (Q(Q — 1)/2 sub-likelihoods) and the Q 
likelihoods of one ordinal variable and all continuous variables. For more details, 
see [22]. 


5 Model Identifiability 


The combination of ordinal data and composite likelihood requires specific attention 
to identifiability issues. Composite likelihood estimation methods provide good 
estimators as long as the model is identified, i.e. if the composite likelihood is 
rich enough to include all the information about the parameters [14]. In other 
words, the marginals involved in the composite likelihood should be able to capture 
and identify the true cluster structure underlying the data. As an example, an 
identified model could be not identified looking only at all the bivariate marginals. 
We illustrate this aspect through an example for continuous data. The first row of 
Fig. 1 displays two different cluster structures, both generated from a tri-variate 
homoscedastic FMG with four components equally weighted. The only difference 
is given by the centroids of the clusters. The remaining rows show that the same 
bivariate marginals correspond to two different configurations of four clusters. It 
follows that, in some cases, it is not possible to identify the true cluster structure 
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Fig. 1 Simulated example where true cluster structure can be captured only with m = 3. Two 
different cluster structures (first row) lead to the same bivariate marginals (last three rows) 


by looking at only the bivariate marginals. However, we note that in the previous 
example the non-identifiability is due to the perfect overlapping of the centroids of 
the clusters on the marginals, in addition to the same covariance matrix and mixture 
weights. In practice, these conditions are strict and very unlikely because they are 
based on several equalities of parameters. Summing up, we believe that situations of 
non-identifiability are rare in practice, especially with a large number of variables 
and the real problem is the case where the model is weakly identifiable (see [23] 
for further details). In such cases, it is recommended to use higher marginal orders. 
This leads to increase the efficiency of the composite estimators and to improve the 
model identifiability. Specific details and some necessary/sufficient conditions can 
be found in [20, 22, 23]. 
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6 Computation, Classification and Model Selection 


All the models summarized above are estimated within the expectation— 
maximization (EM) framework maximizing a composite log-likelihood. As regards 
the classification, in the context of finite mixture models estimated through a 
full likelihood, an observation is assigned to the component with the maximum 
a posteriori probability (MAP criterion). However, when we adopt a composite 
likelihood approach, this is not possible anymore, since we do not compute the 
joint density for each observation. To solve the problem there are at least two 
different solutions [22, 23]. In the first, the MAP criterion is used where the 
joint probabilities are estimated by evaluating the multidimensional integrals on 
the composite-estimates. In the second, we note that the MAP criterion assigns 
an observation to the component with the maximum scaled fit (scaled by the 
corresponding mixing weight). Similarly, in the composite likelihood framework: 
for one observation it is evaluated its (scaled) composite fit on each component and 
it is assigned to the component corresponding to the maximum (scaled) fit (CMAP 
criterion). In the first case, it is true that there are still multidimensional integrals, 
but they have not to be evaluated many times (as it is needed in the estimation), 
but only once. However, CMAP is more efficient computationally, with competitive 
performance, as shown in [22]. Finally, the best model is chosen by minimizing the 
composite version of penalized likelihood selection criteria like BIC or CLC (see 
{21] and the references therein). 


7 Some Related Models 


The aforementioned models could be seen as an extension, with some modifications, 
of Everitt’s proposal [8] into the composite estimation framework. It would be 
also interesting to make such extension for the proposal of [12]. On the other 
hand, the proposal of [23] can be compared to variable selection and parsimonious 
modelling. Variable selection (see, e.g. [7, 27] for categorical data) is commonly 
based on heuristic methods that are computationally demanding. It assumes that 
only noise variables may exist—it is not assumed the existence of noisy dimen- 
sions. The proposal of [23] can be used to understand how much a variable is 
informative or not for the classification. Within the second purpose, examples of 
parsimonious modelling in the context of continuous data are [6], mixtures of 
factor analysers (see, e.g. [16, 17]), mixtures of principal component analysers 
(see, e.g. [24]). See [2] for a recent review on model-based clustering of high- 
dimensional data. As regards categorical data, we find few analogous proposals 
(see, e.g. [9, 12, 13]). All the aforementioned proposals use a variable reduction 
model, like factor analysis or principal component analysis, to reparametrize the 
component covariance matrices. They do not aim at identifying the informative 
dimensions. Their dimensionality reduction is only local and within the components. 
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Differently, [23] can be used to cluster observations by taking into account the 
presence of global informative/noise dimensions. Finally [22] can be compared with 
some existing proposals on clustering mixed-type data. Lawrence and Krzanowski 
[11] introduced a location mixture model according to which for the continuous 
variables, a Gaussian mixture exists, whose component mean vectors depend on the 
specific combination of categories (i.e. response pattern) assumed by the categorical 
ones. However, it is not identifiable without imposing some constraints on the mean 
parameters of the Gaussian distributions [28]. Furthermore each combination of 
categories identifies a set of clusters: it follows that the total number of clusters 
can be unnecessarily large. A more parsimonious, but less realistic, model is given 
by Hunt and Jorgensen [10], according to which the variables are decomposed 
into conditionally independent blocks containing a set of continuous variables or 
one categorical variable. A more general model can be obtained by exploiting the 
IRT approach. The observed variables, continuous or ordinal, are assumed to be 
independent given a set of continuous latent variables that have a clustering structure 
described by a FMG [3, 5]. Finally, by relaxing the local independence assumption, 
Morlini [18] proposes a model-based clustering for mixed binary and continuous 
variables. The estimation is carried out in two steps using the software LATENT 
GOLD [26]. Differently from the location mixture model proposed by Lawrence 
and Krzanowski [11] and the model proposed by Hunt and Jorgensen [10], in [22] 
there is no local independence or conditionally independent blocks assumption. 
Differently from [3], in [22] the dependencies between variables, both within and 
between groups, can be easily measured. Differently from [18], in [22] the parameter 
estimates are carried out simultaneously. 


8 Real Data Application 


Data is composed of 1599 Portuguese “Vinho Verde” wine (red wine) described by 
eleven physicochemical continuous variables (fixed acidity, volatile acidity, citric 
acidity, residual sugar, chlorides, free sulphur dioxide, total density dioxide, density, 
pH, sulphates, alcohol) and one ordinal variable (note of quality). Different models 
have been compared: the FMG for all data (naive approach—treating the ordinal 
variable as continuous), the latent FMG only for the ordinal variable and the model 
illustrated in Sect. 4 (HetMixtureMixed). All models have been fitted with different 
number of groups, G = 2,3,4. The cCLC (144,990) selects G = 3 as the best 
solution for the HetMixtureMixed model, compared to 181,800 and 154,660 for 
G = 2 and G = 4, respectively. The BIC selects G = 2 as the best solution for 
FMG on all data (naive approach) and the latent FMG on the only ordinal variable. 
For the naive approach the BIC is 16,547, compared to 19,185 and 19,645 for G = 3 
and G = 4, respectively. For the latter the BIC is 16,547 compared to 19,185 and 
19,645 for G = 3 and G = 4, respectively. The difference in G across the three 
models can be justified as follows: the presence of the ordinal variable tends to guide 
the choice of cluster number. Indeed, looking at the relative frequency distribution 
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of the ordinal variable, the number of groups tends to coincide with the number 
of categories most frequent (quality variable assumes value k = 1,..., 6 with the 
following frequencies 0.01, 0.03, 0.43, 0.40, 0.12, 0.01, respectively). Furthermore, 
the adjusted Rand index obtained comparing the fitted partition provided by the 
latent FMG only for the ordinal variable with the partition obtained by assigning 
to the wines label 1 if x, = 1,2, 3 and 2 otherwise is equal to 0.8485. It follows 
that by fitting the latent FMG for mixed-type data (and thus by taking into account 
the nature of ordinal variables properly), it mitigates the effect of the ranks on the 
clusters. The three groups fitted by the latent FMG for mixed-type data represent 
different values of wine quality: high quality (p; = 0.61), medium quality (p2 = 
0.22) and low quality (p3 = 0.17). The high quality wine group is characterized 
mainly by lower levels of acidity, pH, chlorides and sulphites (these levels will 
increase as the wine quality decreases). It presents high correlation between both 
sulphur measures opposite to a high correlation between the density and acidity 
measures. The low quality wine group takes larger values for both sulphur dioxide 
measures and the alcoholic rate. In this class, the wine quality is correlated with 
a large alcoholic measure and small values for the chlorides and acidity measures. 
The second and the third group present similar features; the main difference is given 
by the total sulphur dioxide that is twice in group 2 than in group 3. 


References 


1. Bock, D., Moustaki, I.: Item response theory in a general framework. In: Handbook of Statistics 
on Psychometrics. Elsevier, Amsterdam (2007) 
2. Bouveyron, C., Brunet, C.: Model-based clustering of high-dimensional data: a review. 
Comput. Stat. Data Anal. 71, 52-78 (2012) 
3. Browne, R.P., McNicholas, P.D.: Model-based clustering, classification, and discriminant 
analysis of data with mixed type. J. Stat. Plan. Inference 142(11), 2976-2984 (2012) 
4. Cagnone, S., Viroli, C.: A factor mixture analysis model for multivariate binary data. Stat. 
Model. 12, 257-277 (2012) 
5. Cai, J.H., Song, X.Y., Lam, K.H., Ip, E.H.S.: A mixture of generalized latent variable models 
for mixed mode and heterogeneous data. Comput. Stat. Data Anal. 55(11), 2889-2907 (2011) 
6. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28(5), 
781-793 (1995) 
7. Dean, N., Raftery, A.E.: Latent class analysis variable selection. Ann. Inst. Stat. Math. 62(1), 
11-35 (2010) 
8. Everitt, B.: A finite mixture model for the clustering of mixed-mode data. Stat. Probab. Lett. 
6(5), 305-309 (1988) 
9. Gollini, I, Murphy, T.: Mixture of latent trait analyzers for model-based clustering of 
categorical data. Stat. Comput. 24(4), 569-588 (2014) 
10. Hunt, L., Jorgensen, M.: Clustering mixed data. Wiley Interdiscip. Rev. Data Min. Knowl. 
Discov. 1(4), 352-361 (2011) 
11. Lawrence, C., Krzanowski, W.: Mixture separation for mixed-mode data. Stat. Comput. 6(1), 
85-92 (1996) 
12. Lubke, G., Neale, M.: Distinguishing between latent classes and continuous factors with 
categorical outcomes: class invariance of parameters of factor mixture models. Multivar. Behav. 
Res. 43(4), 592-620 (2008) 


Clustering for Mixed-Type Data 53 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21, 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


Marbac, M., Biernacki, C., Vandewalle, V.: Finite mixture model of conditional dependencies 
modes to cluster categorical data (2014, preprint). arXiv: 1402.5103 

Mardia, K.V., Kent, J.T., Hughes, G., Taylor, C.C.: Maximum likelihood estimation using 
composite likelihoods for closed exponential families. Biometrika 96(4), 975-982 (2009) 
McLachlan, G.J., Rathnayake, S.I.: Mixture models for standard p-dimensional Euclidean data. 
In: Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 
145-172. CRC Press, Boca Raton (2016) 

McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of the mixture of factor 
analyzers model to incorporate the multivariate t-distribution. Comput. Stat. Data Anal. 51, 
5327-5338 (2007) 

McNicholas, P., Murphy, T.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 
285-296 (2008) 

Morlini, I.: A latent variables approach for clustering mixed binary and continuous variables 
within a Gaussian mixture model. Adv. Data Anal. Classif. 6(1), 5—28 (2012) 

Muthén, B.: A general structural equation model with dichotomous, ordered categorical, and 
continuous latent variable indicators. Psychometrika 49(1), 115-132 (1984) 

Ranalli, M., Rocci, R.: Mixture models for ordinal data: a pairwise likelihood approach. Stat. 
Comput. 26(1), 529-547 (2016) 

Ranalli, M., Rocci, R.: Standard and novel model selection criteria in the pairwise likelihood 
estimation of a mixture model for ordinal data. In: Wilhelm, A.F.X., Kestler, H.A. (eds.) 
Analysis of Large and Complex Data. Studies in Classification, Data Analysis and Knowledge 
Organization, pp. 53-68. Springer, Cham (2016) 

Ranalli, M., Rocci, R.: Mixture models for mixed-type data through a composite likelihood 
approach. Comput. Stat. Data Anal. 110(C), 87-102 (2017). https://doi.org/10.1016/j.csda. 
2016.12.01 

Ranalli, M., Rocci, R.: A model-based approach to simultaneous clustering and dimensional 
reduction of ordinal data. Psychometrika (2017). https://doi.org/10.1007/s11336-017-9578-5 
Tipping, M.E.: Probabilistic visualisation of high-dimensional binary data. In: Proceedings of 
the 1998 Conference on Advances in Neural Information Processing Systems II, pp. 592-598. 
MIT Press (1999) 

Varin, C., Reid, N., Firth, D.: An overview of composite likelihood methods. Stat. Sin. 21(1), 
1-41 (2011) 

Vermunt, J.K., Magidson, J.: Latent GOLD 4.0 User’s Guide. Statistical Innovations Inc., 
Belmont (2005) 

White, A., Wyse, J., Murphy, T.B.: Bayesian variable selection for latent class analysis using a 
collapsed Gibbs sampler (2014, preprint). arXiv: 1402.6928 

Willse, A., Boik, R.: Identifiable finite mixtures of location models for clustering mixed-mode 
data. Stat. Comput. 9(2), 111-121 (1999) 


Part II 
Exploratory Data Analysis 


Preference Analysis of Architectural M®) 
Facades by Multidimensional Scaling pea ls 
and Unfolding 


Giuseppe Bove, Nicole Ruta, and Stefano Mastandrea 


Abstract The methods of paired comparison and ranking play an important role 
in the analysis of preference data. In this study, first we show how asymmetric 
multidimensional scaling allows to represent in a diagram the preference order that 
comes out in a paired-comparison task concerning architectural fagades. A ranking 
task involving the same stimuli and the same subject sample further enriched the 
preference analysis, because multidimensional unfolding applied to the ranking data 
matrix allows to detect the relationships between subjects and architectural facades. 
The results show that high curved facade is the most preferred, followed by the 
medium curved, angular and rectilinear ones. Rectilinear stimuli were always the 
least preferred and not angularity as expected. 


Keywords Preference data - Asymmetric multidimensional scaling - 
Multidimensional unfolding 


1 Introduction 


Several studies showed that people prefer curved objects compared to angular 
ones and that curved polygons are more easily associated with safe and positive 
concepts and with female names compared to their angular counterpart, but the 
elements that drive this preference are still unclear [7]. In this study, the role 
of curvature in driving preferences is generalized to the architecture domain, by 
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focusing on classical architectural fagades [4]. The classical Oratorio dei Filippini 
architecture by Francesco Borromini (Rome, 1637-1650) was chosen as reference 
building, a typical Baroque building due to its characteristic curved lines. We asked 
a professional architect to render a simplified 2D model of the selected architectural 
fagade in order to make our stimuli more realistic. The architectural fagade was 
modified controlling for the global and local amount of curvature to introduce. Four 
versions of the same building were produced and used in the experiment (see Fig. 1): 
a. high curvature: global and local curvature; b. medium curvature: global curvature 
and local straight; c. rectilinear: global and local straight; d. angular: global and 
local angular. 

Preferences were collected from twenty-four volunteers recruited from the 
student population of Roma Tre University, with two different methods: a paired- 
comparison task and a ranking task. To further investigate the role of expertise, at 
the end of the tasks participants were asked to self-report on a five-point Likert scale 
their artistic education level and their art interest. 

In the following sections we will report separately the results of the paired- 
comparison and ranking tasks. Conclusions are discussed in the final section. 


c d 


Fig. 1 The four architectural facades used in the study. From (a) to (d): high curvature, medium 
curvature, rectilinear and angular 
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2 Paired-Comparison Task 


All the possible six pairs of the four fagades (Fig. 1) were presented to all 
respondents in a random order without repetitions. Stimuli were projected on a 
screen at a distance of two metres approximately. All participants viewed each pair 
for 3s and recorded the preferred facade on a sheet of paper provided individually. 
The dominance matrix shown in Table | summarizes results of paired comparisons. 
Positive entries represent the number of times the row facade was preferred to the 
column fagade, and main diagonal elements are conventionally set to zero. All the 
corresponding off-diagonal elements satisfy a constant sum property (i.e., all pairs 
of corresponding entries (i, 7) and (j,i) sum up to 24), resulting in the sum of 
row and column totals for each fagade being also constant. Thanks to this way of 
representing data, we can easily obtain the facades preference order by the row 
totals of the dominance matrix, that is—from the most to the less preferred—A 
(high curvature), B (medium curvature), D (angular), C (rectilinear). 

Another consequence of the previous properties is that symmetry is not inter- 
esting in this matrix, but it is worthwhile to focalize on the skew-symmetric 
information. In linear algebra it is known that any square matrix 2 = {a;;} can 
be additively and uniquely decomposed into a symmetric part M = {mij} = 
{0.5(@;; + @;;)} and a skew-symmetric part N = {nij} = {0.5(@j; — w;i)}, with 
Q = M+N. The matrix M is the best symmetric least squares approximation to &, 
the matrix N describes the departures from symmetry and n;; = —n ;; holds for each 
pair (i, 7). For the dominance matrix shown in Table 1, all the symmetric entries 
mij (i # Jj) are equal to 12, the skew-symmetric entries n;; are the difference of 
the corresponding frequency in the matrix by the value 12, which in our experiment 
corresponds to the situation of equilibrium (12 subjects prefer one fagade and other 
12 subjects prefer the other one). 

In his pioneering paper Gower [3] proposed to represent the skew-symmetric 
component N by singular value decomposition. The interpretation of the diagram 
obtained by singular vectors is not in terms of distances but in terms of areas, 
in particular the area of triangles that pairs of points form with the origin is 
proportional to the size of skew-symmetry, whose sign is given by the plane 
orientation. A more detailed description of the non-Euclidean geometry of this type 
of diagrams can be found in [3]. The skew-symmetric component of the dominance 
matrix in Table | is represented in Fig.2. The preference order A, B, D, C is 
easily detected in the diagram going from point A to point C in counter-clockwise 
direction (skew-symmetry is positive). Fagades A and C have the largest imbalance 


Table 1 Dominance matrix 


: ; Facades |A |B |C |D 
for the paired-comparison 


A 0 |21 |22 |21 
task 

B 3 19 |17 

Cc 2 5 | 0) 6 

D 3 18 | 0 
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Fig. 2. Gower diagram for 
the skew-symmetric 
component of data in Table | 
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between each other, so the area of the triangle the two corresponding points form 
with the origin is large (A dominates C because comes first in counter-clockwise 
direction, see Fig. 2). Fagades B and D have the smallest imbalance so the area of 
the corresponding triangle is small, B dominates D. 

To make easier diagram interpretation, methods based on distance models were 
also considered to represent skew-symmetry. Bove [2] proposed a method of 
asymmetric multidimensional scaling that adapted the idea originally proposed by 
Okada and Imaizumi [6] for asymmetric proximities to skew-symmetric data. The 
graphical representation is obtained by the following two steps. 


Step 1 





In the first step, sizes of skew-symmetries, given by the absolute values |nij ; 
are represented by distances in a low-dimensional Euclidean space (usually two- 
dimensional) according to the following model: 





F (nil) = iy + 6i7 = | (eis — 6)? be () 


s=1 


where f is a chosen data transformation function (e.g., interval, ratio, ordinal 
transformations); dj; is the distance between facade i and facade j (dij = dji ); 
Xjs and x js are the coordinates on dimension s, respectively, of fagade i and facade 
j and &;; is a residual term. The model can be easily estimated by standard statistical 
software containing symmetric multidimensional scaling routines. An advantage of 
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this model is that it is easy to incorporate both non-metric approaches and external 
information regarding the objects. 


Step 2 


In the second step, the signs of skew-symmetry (and so the preference order) 
are derived from the comparison of circles represented around the points of the 
configuration obtained from step 1. The radii 7; of the circles are estimated by the 
following model: 


vig = (ni — rj) +e (2) 


where y;; = 1 if njj is positive and y;; = —1 if nj; is negative. As a result, when 
the circle around point i is larger than circle around point j, the estimate of yj; 
is positive and the estimate of yj; is negative (consequently, estimate of skew- 
symmetry nj;; 1s positive and estimate of skew-symmetry nj; is negative). A least 
squares solution for the r;’s is 


os te 
Fi = - dX Vij (3) 
j= 


with )-/_, 7; = 0, being matrix T = (vis) skew-symmetric. Any translations 7; +c 
by a constant c is equivalent to the initial solution. However, it is convenient to 
choose only between solutions with no negative 7;’s, because they represent radii. 
In this application, we chose the unique solution having min (7;) = 0. 

In our application, step 1 was performed by the PROXSCAL program with a 
transformation ratio option for |n;;|, radii in step 2 were computed by a Matlab 
routine. The method represents the architectural fagades as points in a two- 
dimensional diagram (Stress-I=0,11). Both the facade preference orders and the 
imbalances are represented: the former as circles with different radii (larger circles 
correspond to higher ranks of preference), the latter as the distances between points 
(larger distances correspond to lower equilibrium). The results are shown in Fig. 3. 

The overall preference order (A, B, D, C) is represented by the size of the circles. 
Facade A is the most preferred and is liked equally more than B, C and D. Facades 
B and D have the smallest imbalance between each other, so they are represented as 
closer on the plane. Facade C is represented with no ray, so it is dominated by all the 
other facades, but much more by A and B that are positioned further away from it. 
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Fig. 3. Asymmetric multidimensional scaling representation for data in Table | 


Table 2 Order choices in the 
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ranking task 
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3 Ranking Task 


In this task, we showed the four fagades at the same time on a projection screen, 
assigning to each of them a corresponding letter, placed at the bottom of the picture. 
Participants had a grid printed on a sheet of paper. The grid consisted of four boxes 
with growing numbers, from | to 4. People had to write on the corresponding row 
the fagade’s letter, according to their preferences. We asked them to classify the 
fagades from the most (= 1) to the least (= 4) preferred. Table 2 reports the number 
of times each fagade (row) was chosen in an order position (column) by participants. 
The highest frequency in each row of the table allows to confirm the preference order 
showed in the paired-comparison task: A, B, D, C. 

Besides, we analysed the (24 x 4) ranking data matrix P = ( Di i) with multidi- 
mensional unfolding technique (e.g., [1], Chaps. 14-16) to represent relationships 
between subjects and facades. The unfolding model can be expressed in scalar 
notation as 





: 
f (ij) = di +e = | Do (Ris — vis) + 8% @ 
il 
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Fig. 4 Multidimensional unfolding representation for rank order scores (subject artistic education 
level labels: none, little, enough, much) 


where, as before, f is a chosen data transformation function; a is the distance 


between subject i and facade j (so it can be a # ai. ); Xis and yjs are the 
coordinates on dimension s, respectively, of subject i and facade j and ¢;; is a 
residual term. A diagram for the pad of relationships is obtained by coordinates 
Xiy and yjs, so that the distances di; "! from subject points i to facade points j 
correspond to the rank order scores Be j»> with high rank order scores corresponding to 
small distances. Moreover, this method can be easily applied by standard statistical 
software containing multidimensional unfolding routines. The results obtained with 
PREFSCAL program are shown in Fig. 4, where numbers represent the subjects and 
letters represent the facades (Stress-I=0,07). Labels attached to each subject number 
represent artistic education levels. According to the unfolding model properties, the 
subjects tend to be closer to the facades for which they expressed a higher rank 
in the task. Overall, fagade A—high curvature—and to a less extent fagade B— 
medium curvature—are the two stimuli around which the majority of the subjects 
is placed, being also the one with the higher artistic education. Subjects 4, 9 and 15 
preferred facade D, but their artistic education is positioned at a medium-low level. 
Only one subject (subject 14) preferred fagade C, but she has the lowest level of 
artistic education, corresponding to no artistic education at all. 
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4 Conclusions 


In this article, we showed that graphical representations obtained by multidi- 
mensional scaling and unfolding allow to easily detect preference order, size of 
asymmetry and relationships between subjects and stimuli. In our experiments it 
was confirmed that curvature influences preferences also for architectural stimuli. 
The high curvature fagade was the most preferred, followed by the medium, 
angular and rectilinear ones. The rectilinear stimulus was always the least preferred 
and not the sharp one as expected. This result provides an important insight to 
better understand human preferences, suggesting that the curvature effect can be 
modulated by controlling for the level of sharpness of the stimuli it is compared 
with. In line with previous research showing that expertise plays an important 
role in influencing aesthetic judgments for sharp stimuli, our study showed that 
participants with relatively poor art training preferred rectilinear fagades, while 
people with higher levels of artistic training preferred the curved ones. Due to 
the non-probabilistic features of our small sample, we followed a ‘data-analytic’ 
approach emphasizing the graphical display of data. Future developments of this 
research will consider selection of large probabilistic samples in order to confirm 
our hypothesis and generalize our results in a more formalized context (e.g., [5, 8]). 


Acknowledgements We thank the architect Stefania Lamaddalena for sharing with us her 
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this study. 
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Community Structure in Co-authorship ®) 
Networks: The Case of Italian cts 
Statisticians 


Domenico De Stefano, Maria Prosperina Vitale, and Susanna Zaccarin 


Abstract Community detection is a very appealing topic in network analysis. A 
precise definition of community is still lacking, so the comparison of different 
methods is not a simple task. This paper shows exploratory results by adopting two 
well-known community detection methods and a new proposal to discover groups 
of scientists in the co-authorship network of Italian academic statisticians. 


Keywords Co-authorship networks - Community detection algorithms - 
Modularity - Italian statisticians 


1 Introduction 


In the last decades social network analysis (SNA) has become a widespread 
methodological approach to study scientific collaboration. As stated in several 
studies [5, 8], scientific collaboration is a crucial factor to enhance publication 
productivity and research quality. The role of scientific collaboration allowing a 
fertile ground for the development of new ideas is also recognized in research 
funding European programmes as well as national projects. 

Thanks to the availability of international bibliographic archives, co-authorship 
networks—in which the connection between two researchers is given by the number 
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of papers they co-authored—are used as a proxy of scholars’ collaborative behavior 
in science [2]. Usually, binary networks—setting the connections greater than zero 
to one—are considered in empirical analysis. A common aim in co-authorship 
studies through SNA perspective is the understanding of network properties since 
the evolution of topics and methods in scientific fields appears strongly related to the 
topological structure of the collaboration patterns among scholars. In this stream of 
research, the recovery of communities—the term used to identify groups or clusters 
of actors in a graph—shaping the network structure sounds very appealing and 
informative. Unfortunately, a precise definition of what constitutes a community— 
broadly, part of a network where internal links are denser than external ones—is still 
lacking [16]. As a consequence of this conceptual vagueness, several community 
detection algorithms have been proposed in the literature [9]. 

Starting from previous findings on small-world topology in the co-authorship 
network of Italian academic statisticians [6, 11], the present contribution intends 
to deepen the analysis of this case study uncovering a meaningful community 
structure for Italian scholars. To this aim, results from three community detection 
methods, the Girvan—Newman algorithm [13], the Louvain algorithm [3], and a new 
method—modal clustering algorithm [12]—will be compared. The evaluation of 
performance measures [16] and the interpretation of main results should benefit 
from the common clustering perspective shared by the three algorithms. 

This paper is organized as follows: Section 2 reviews the main characteristics 
of the three methods and their performance in identifying communities within an 
illustrative example. Section 3 discusses the main results obtained by adopting the 
aforementioned methods on the co-authorship network of Italian statisticians using 
also available scholar’s attributes (i.e., scientific field and university affiliation). 
Section 4 reports new lines of research for future work. 


2 Community Detection Methods 


Similarly to the problem of clustering for attribute data, the lack of a unique defi- 
nition of community in the presence of network data has led to the proliferation of 
several methods in different theoretical contexts. Among them, some are explicitly 
designed to handle these kinds of data. For instance, blockmodeling [7, pp. 11-12] 
is a methodological approach “to identify, in a given network, clusters of actors 
that share structural characteristics in terms of some relations,’ mainly based on 
partitioning the relational matrix into a set of blocks. 

Recently, a huge variety of network-based clustering techniques, the so-called 
community detection methods, have been developed based on hierarchical cluster- 
ing techniques [13], locating network communities by statistical analysis of the raw 
data [14], or optimizing different quality functions [9]. 
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In the following, we focus on two well-known community detection algorithms, 
and a new method based on an adaptation to network data of modal clustering 
procedure (for an overview with standard data, see [1]): 


1. the Girvan—Newman algorithm [13], one of the most popular community detec- 
tion approach. It is based on a hierarchical divisive procedure in which links are 
iteratively removed based on the value of the edge’s betweenness. The procedure 
of link removal ends when the value of the modularity index Q is maximized. 
This index [4, 13] measures the fraction of the edges in the network that connect 
nodes within-community minus its expected value in the case of a network with 
edges placed at random. It assumes a minimum value of 0, when the number 
of within-community edges is no better than the randomized network, and a 
maximum value of | in the presence of strong community structure. The index 
usually falls in the range 0.3-0.7, and a value of around 0.3 is a good indicator 
of significant community structure in the network; 

2. the Louvain algorithm [3], also based on the modularity index and on a 
hierarchical approach. Initially, each node is assigned to a community on its own. 
In every step, nodes are re-assigned to communities in a local, greedy way: each 
node is moved to the community in which it achieves the highest contribution to 
the modularity; 

3. the modal clustering algorithm [12], which starts from the idea that highly 
connected sets of nodes can be detected around the modes of a “density” function 
f reflecting the cohesiveness between nodes—e.g., centrality measures [10] like 
the node degree (i.e., the number of links a node has with the other nodes in 
the network) or the actor betweenness (i.e., the number of those shortest paths 
passing through a specific node connecting two other nodes). The modes of f 
are seen as the archetypes of the clusters, which are in turn represented by their 
surrounding regions. Any section of f, at a level A, identifies a level set, namely 
the region with f value above 4. The key idea is that when f is unimodal, there 
is no clustering structure, and the level set is connected for any choice of A. 
Conversely, when f is multimodal, the identified level set may be connected or 
not, depending on A value. In particular, nodes are clustered together when they 
have a value of f above the examined threshold A and they are connected in the 
underlying network. Clustering is performed around the modal actors, namely 
actors showing the largest value of the chosen function. Furthermore, by varying 
the level set the method gives rise to a tree diagram, called cluster tree (which is 
graphically similar to a dendrogram), where each leaf corresponds to a mode of 
the function. 


The first two algorithms are particularly suited for undirected and unweighted 
relational data (likewise the most usual case of co-authorship data obtained disre- 
garding the number of papers co-authored by pairs of scholars), while the third one 
is more flexible since different concepts of cohesiveness among actors can be used. 

To compare the three approaches in discovering communities, we consider the 
Zachary’s karate club network data [17] describing the friendship relationship 
among 34 members of a karate club at a US university in the 1970s. A useful 
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Fig. 1 Comparison of the three community detection methods for Zachary’s karate club network 
data: (a) Girvan—Newman algorithm; (b) Louvain algorithm; (c) modal clustering algorithm 


feature of this dataset is that, during the period of observation, the club split into 
two factions, due to a dispute between the administrator and the karate instructor. 
Thus, a true cluster membership of the actors in the network is known and can be 
used as a benchmark to evaluate the performance of different methods. Figure 1 
shows the communities identified by using the three algorithms. It is possible to 
appreciate that the modal clustering method, using node degree as density function 
to reflect actors’ cohesiveness, allows to detect the two factions underlying the 
networks. In particular, the method works by clustering every actors around the 
modal actors—that are the two most central ones in terms of their degree in Fig. |c— 
that, incidentally, are the members around which the karate club splits into two 
distinct factions. The other approaches are able to detect different partitions, in 
particular consisting of four groups. 


3 Community Detection Results for Italian Statisticians 


The three aforementioned community detection methods are used to analyze the 
co-authorship network defined for the population of the 792 Italian academic 
statisticians belonging to five scientific subfields, as recorded in the Italian Ministry 
of University and Research (MIUR) database at March 2010. To collect publications 
three bibliographic archives—two international (Web of Science and Current Index 
to Statistics) and one national based on publications attached to the nationally 
funded grants (PRIN projects)—are considered [6]. Hence the co-authorship net- 
work under analysis is the result of combining multiple data sources [11]. 


'The five subfields established by the Italian governmental official classification are: Method- 
ological Statistics, Statistics for Experimental and Technological Research, Economic Statistics, 
Demography, and Social Statistics. 
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The general aim of the community detection procedures here adopted is to 
discover if the co-authorship network of Italian statisticians can be clustered into 
communities. In order to let the results comparable, the three community detection 
methods are performed on the largest connected component of the given graph (.e., 
giant component). This approach, recognized in the related literature in order to 
isolate disjoint components [15], is useful in our case given that only the modal 
clustering algorithm is able to handle disconnected graphs. In the observed co- 
authorship network, the giant component consists of 660 authors, representing the 
82% of statisticians. Therefore the analysis can be restricted to this set of authors 
without loss of generality. In performing the modal clustering method, two different 
density functions (degree and betweenness) are chosen. 

The main results of the three procedures are reported in Table 1. In general, 
the methods are quite comparable in terms of number of detected communities 
and of their sizes. The Girvan—Newman algorithm produces the larger number of 
communities (#. 22). Also the quality of the partitions, measured by the modularity 
index Q, is quite similar across methods. The lower value is associated with modal 
clustering with the betweenness as density function that is the method that also gives 
raise to communities of relative larger sizes with respect to the other two methods. 

The modal clustering (with degree as density function) and the Louvain algo- 
rithm show the highest—and similar—values of the modularity index as well as 
the same total number of detected communities (#. 18). In the following, the 
composition of the first 9 larger communities identified by these two approaches is 
analyzed. These larger communities are quite representative since for both methods 
they comprise about the 70% of the 660 statisticians in the giant component. 

Table 2 reports some descriptive measures of the 9 communities listed in 
descending order by size. In both algorithms, the detected communities share 
quite similar structural characteristics. By way of example, the largest community 
(C1) comprises 91 and 69 statisticians, for the modal clustering and the Louvain 
algorithm, respectively. 

The author average degree—computed within the community—is usually com- 
parable across methods, ranging from a minimum of 1.75 (community C4 by modal 
clustering) to a maximum of 4.04 (community C3 for Louvain algorithm). The 
ratio between within-community links (edges representing the relationship in the 
same community) and the external links (edges activated with nonmembers of the 


Table 1 Performance measures of giant component of the Italian statisticians co-authorship 
network by methods 


Method Cc Average (St. Dev.) Q 

Girvan—Newman 22 30.000 (15.754) 0.752 
Louvain — 18 36.667 (17.283) 0.762 
Modal clustering (betweenness) 13 50.769 (30.444) 0.702 
Modal clustering (degree) 18 36.667 (23.118) 0.761 


C=#. of detected communities, Average = Average number of authors in communities (St. Dev.), 
Q = modularity index 
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Table 2 Descriptive measures of the first nine detected communities obtained by the modal 
clustering (MC) and the Louvain algorithms for the giant component of the Italian statisticians 
co-authorship network 

Size Average degree author Intra-extra links ratio 
Community |MC (degree) | Louvain | MC (degree) | Louvain | MC (degree) | Louvain 


Cl 91 69 3.52 2.92 0.120 0.056 
C2 67 65 2.48 3.75 0.063 0.092 
C3 57 53 2.60 4.04 0.057 0.022 
C4 49 49 1.75 4.00 0.011 0.021 
CS 49 49 2.00 2.98 0.016 0.021 
C6 48 48 2.00 3.29 0.039 0.026 
C7 48 44 2.17 3.36 0.018 0.076 
C8 47 41 3.23 3.61 0.038 0.056 
C9 44 40 2.32 3.35 0.036 0.050 


community) is quite small for both methods. Looking at the internal composition 
by scientific subfield and university affiliation, in the Louvain method, the largest 
community includes several authors in the statistics subfield. 

In the modal clustering, the emerging largest community is composed mostly 
of authors in statistics subfield and some authors in economic statistics subfield, 
mainly clustered according to the geographic proximity of their universities. In 
particular, the majority of authors in this cluster are affiliated to the universities 
located in the North and in the Center of Italy (e.g., Florence, Padua, Rome, 
and Milan). The same differences arise looking at the composition of the other 
larger detected communities. Both methods find clusters that are homogeneous 
by scientific sectors (demographers and social statisticians, on the one hand, and 
methodological statisticians, on the other hand, tend to create strong communities), 
although it seems that modal clustering groups together authors on the basis of 
links mainly driven by the geographic proximity of the universities in which they 
are affiliated, while Louvain algorithm aggregates authors on the basis of network 
characteristics. 

Generally speaking, comparing all possible couples of communities, the overlap- 
ping among the detected communities is low. The average Jaccard index is indeed 
equal to 0.02. Only some communities present a sort of overlapping with about 
30% of common members, as showed in the example in Fig.2 for community 
1 (C1) in the Louvain algorithm and community 4 (C4) in the modal clustering 
algorithm. These methods are therefore able to capture common relational aspects 
of the observed co-authorship network enriching the interpretation of the findings 
related to the authors’ attributes. 
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Fig. 2 Representation of the communities with the largest overlapping number of actors: (a) 
community 1 (C1) Louvain algorithm; (b) community 4 (C4) modal clustering algorithm. The 
names of the statisticians common to both communities are displayed 


4 Conclusions 


The general aim of the community detection procedures here adopted was to 
discover if the co-authorship network of Italian statisticians can be clustered into 
communities. To this purpose, results from three different community detection 
methods, the Girvan—Newman algorithm, the Louvain algorithm, and the modal 
clustering algorithm, have been compared by presenting performance measures and 
specific internal communities interpretations. The most suitable methods in terms 
of quality of the partitions discovered are the modal clustering algorithm and the 
Louvain algorithm. 

As general evidence, it seems that the co-authorship network of the Italian 
statisticians is clustered in a relatively small number of communities with different 
internal composition that is mainly determined by authors’ scientific field and 
university affiliation. 

In order to find denser communities it would be important to consider in the 
analysis also the strength of the collaboration relationship by using the number of 
co-authored papers among couples of authors. As future line of research we will 
intend to extend the described community detection methods to weighted networks. 
It also would be interesting to explore the community structures dealing with the 
presence of multiplex networks, when collaboration is described by measuring 
also other kinds of relationships among scientists (e.g., co-participation on funded 
projects). 
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Analyzing Consumers’ Behavior ®) 
in Brand Switching mais 


Akinori Okada and Hiroyuki Tsurumi 


Abstract Asymmetric multidimensional scaling is extended to represent differ- 
ences among consumers in brand switching. The asymmetric multidimensional 
scaling, based on the singular value decomposition, represents asymmetric relation- 
ships among brands in the brand switching by introducing the outward tendency 
which corresponds to the left singular vector and the inward tendency which 
corresponds to the right singular vector. The resulting configuration is represented 
in a plane spanned by the left and the right singular vectors where each brand is 
represented as a point. Each dimension (component) has its own plane or a two- 
dimensional configuration. The asymmetric multidimensional scaling is extended 
so that each consumer is represented as a point in the plane. The joint configuration 
of brands and consumers represents how each consumer or a group of consumers 
relates to brands in the brand switching. The procedure is applied successfully to 
brand switching data among potato snacks. 


Keywords Asymmetry - Brand switching - Consumer - Individual differences - 
Multidimensional scaling 


1 Introduction 


The brand switching is derived by comparing brands purchased in two consecutive 
periods by a consumer. Asymmetric multidimensional scaling has been used to 
analyze brand switching [5, 6]. While these studies represent asymmetric relation- 
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ships of brand switching, they cannot analyze the differences among consumers 
nor their relationships with brands in the brand switching. It is important to know 
differences among consumers [3] and to disclose relationships between consumers 
and the brand switching they did [1]. The present study extends the asymmetric 
multidimensional scaling so that differences among consumers can be represented. 
In particular, our approach allows also to visualize how each consumer or a group 
of consumers relates to brands in the brand switching. 


2 Method 


The asymmetric multidimensional scaling based on the singular value decompo- 
sition [2] is briefly described below. Let A be an n x n matrix of asymmetric 
brand switching matrix, where n is the number of brands. The (j, k) element of 
A represents the frequency of the brand switching from brands j to k, where brand 
j is purchased at the first period, and brand k is purchased at the second period. 
By the singular value decomposition, A is approximated by using r dimensions 
(components) as; 


A~ XDY’, 


where D is the r x r diagonal matrix of r largest singular values (dj,...,d,) in 
descending order at its diagonal elements, X is the n x r matrix of corresponding 
left singular vectors (the length is unity), and Y is the n x r matrix of corresponding 
right singular vectors (the length is unity). The jth element of the ith column of X 
represents the outward tendency of brand j along Dimension /, and the kth element 
of the ith column of Y represents the inward tendency of brand k along Dimension /. 

Let S be an N x n matrix where each row has one element of | and (n — 1) 
elements of 0, where N is the number of consumers. If the (m, 7) element of S is 1, 
consumer m purchased brand j at the first period. Similarly, T is an N x n matrix 
where each row has one 1 and (n — 1) elements are 0. If the (m, k) element of T is 
1, consumer m purchased brand k at the second period. A can be derived by 


A=S'T. 


Thus S'T = A ~ XDY’. We define F = SX and G = TY, where F and G 
are the N x r matrices. The mth row of F represents the outward tendency of 
consumer m along r dimensions. The mth row of G represents the inward tendency 
of consumer m along r dimensions. By deriving the outward and inward tendencies 
of a consumer or a group of consumers in the planar configuration of brands, 
relationships of a consumer or a group of consumers with brands in the brand 
switching are shown. 
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Table 1 Brand switching Period 2 

ree aaa a Period! [A |B |C |[D JE |F|G |O |Z 
Brand A 140 | 15 | 8 | 3 | 7 |0} 3 |15 | 5 
Brand B 15 |129 }12 | 8 |16 |0|12 | 24 | 12 
Brand C 5 | 16 }27 | 4} 9 |0/} 5 |}91 | 9 
Brand D 2 71/9/32 )/ 5/0} 1] 44 3 
Brand E 8} 19} 5 | 5 }45 |0) 7 }17 | 14 
Brand F 2 9} 4/2)]1/0} 0 1 
Brand G 1 2); 1/2)]1/)0} 9} 34 3 
(component) 
Brand O 5} 10} 7)6)]4/0); 3|}7)/4 
Brand Z 10 | 10} 6 | 9 }21 |0}| 4 | 15 | 22 


3 A Real Dataset: Potato Snack Brands and Their Customers 


The brand switching data were collected at two periods (period 1: June 2—July 31; 
period 2: August 1—August 31 of 2009). The frequency of the brand switching in 
Table | is the number of consumers who changed the largest purchase brand from 
periods 1 to 2. The brand switching matrix is derived from the purchase record of 
882 customers who purchased potato snacks at both periods | and 2. Table 1 shows 
a brand switching matrix among nine potato snack brands (A, B, C, D, E, F, G, O, 
Z). Brand O represents brands other than A, ..., G, and Z. The elements of the sixth 
column of Table | are null, because brand F was withdrawn at period 2. The (j, k) 
element of Table | is the number of consumers whose largest purchase brand was j 
at period | and k at period 2. A further detail is given in [6]. 


4 Data Analysis and Obtained Results 


The 9 x 9 brand switching matrix shown in Table | is asymmetric, and was analyzed 
by the asymmetric multidimensional scaling. The five largest singular values are 
162.1, 122.1, 54.5, 32.6, and 22.5. The three-dimensional result was chosen as the 
solution same to the earlier study [6]. Each dimension (component) has its own 
planar configuration: Dimension 7 has a planar configuration spanned by the left 
singular vector (abscissa) and the right singular vector (ordinate) correspond to the 
ith largest singular value (see Figs. 1, 2, and 3). The abscissa represents the outward 
tendency which tells the weakness of a brand or the easiness to be switched from 
the brand to the other brands, and the ordinate represents the inward tendency which 
tells the strength of a brand or the easiness to be switched to the brand from the 
other brands. The outward and inward tendencies of each consumer are derived 
by F = SX and G = TY. Then 882 consumers were classified into two groups 
according to the amount of money they spent on potato snacks, Group | consists of 


76 A. Okada and H. Tsurumi 


1.0 
2 
- 
& 
Z 08 
E 
a ‘e 
Be 
0.6 
@ Group 1 
0.4 
co) Group 2 





0 0.2 0.4 0.6 0.8 1.0 


Dimension | out x 


Fig. 1 Joint configurations of brands and groups along Dimension 1 
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Fig. 2 Joint configurations of brands and groups along Dimension 2 
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Fig. 3 Joint configurations of brands and groups along Dimension 3 


Table 2 Mean outward and inward tendencies for Groups 1 and 2 of consumers: Group | (larger 
and equal to the average) and Group 2 (smaller than the average) 


Dimension | | Dimension 2 Dimension 3 
Group | Outward Inward Outward Inward Outward | Inward 
1 | 0.477 0.493 —0.158 —0.120 0.026 | —0.013 
2 | 0.353 0.333 0.148 0.130 0.122 | 0.136 


270 consumers whose amounts of money are larger or equal to the average amount 
of money, and Group 2 consists of 612 consumers whose amounts of money are 
smaller than the average. The means of outward tendency and inward tendency for 
Groups | and 2 along Dimensions 1, 2, and 3 are shown in Table 2. 

The mean outward tendency for a group represents the mean of outward 
tendencies of brands from which consumers in the group switched to the other 
brands along each dimension. And the mean inward tendency for a group represents 
the mean of inward tendencies of brands to which consumers in the group switched 
from the other brands along each dimension. Two groups can be represented, 
respectively, as a point in a configuration of brands along each dimension. Along 
Dimension 1, Group | has the larger mean outward tendency than Group 2 has, 
and this is also true for the mean inward tendency. Along Dimension 1, the mean 
outward tendency is smaller than the mean inward tendency for Group 1, while 
the mean outward tendency is larger than the mean inward tendency for Group 2. 
Along Dimension 2, Group | has negative mean outward and inward tendencies, 
suggesting that the point representing Group | is in the third quadrant (Q3) in the 
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configuration along Dimension 2. The point representing Group 2 is in the first 
quadrant (Q1) in the configuration, because Group 2 has positive mean outward and 
inward tendencies. Similarly, along Dimension 3, the point representing Group 1 is 
in Q4, and the point representing Group 2 is in Q1 in the configuration. 


5 Discussion of the Obtained Findings 


A joint configuration of brands and groups tells relationships among brands, 
among groups, and between brands and groups as well. Figure 1 shows the joint 
configuration of brands and two groups along Dimension 1. Group | is closer to 
brands A and B than Group 2 is, and Group 2 is closer to brands other than A 
and B than Group | is. Brands from which Group | switched to the other brands 
have the larger outward tendency than brands from which Group 2 switched to the 
other brands have. This is also true for the inward tendency. These suggest that the 
brand switching of Group | includes larger proportion of brands A and B (which 
have larger outward and inward tendencies and are nearer to Group | than to Group 
2 in the configuration) than that of Group 2 includes, while the brand switching 
of Group 2 includes larger proportion of brands other than A and B than Group 1 
includes (which have the smaller outward and inward tendencies and are nearer to 
Group 2 than to Group | in the configuration). This is validated by the figures in 
Table 3. 

Table 3 shows the number of brand switchings from/to brand A or B and other 
brands for each of Groups 1 and 2. For the brand switching of Group 1, 171 
(171/270=0.63) of 270 were done from brand A or B, and 99 (99/270=0.37) were 
done from brands other than A and B. For the brand switching of Group 2, 253 
(253/612=0.41) of 612 were done from brand A or B, and 359 (0.59) were done 
from brands other than A and B. For the brand switching of Group 1, 180 (0.67) 
brand switchings were done to brand A or B, and 90 (0.33) were done to brands 
other than A and B. For the brand switching of Group 2, 225 (0.37) brand switchings 
were done to brand A or B, and 387 (0.63) brand switchings were done to brands 
other than A and B. These figures show that the brand switching of Group | more 
closely relates to brands A and B, and less closely relates to brands other than A and 
B, and that the brand switching of Group 2 less closely relates to brands A and B, 
and more closely relates to brands other than A and B. 

Figure 2 shows the joint configuration of brands and groups along Dimension 2. 
Only brand A is in Q3. The other brands are in Q1. Group | is in Q3. This suggests 


Table 3 Brand switching Brand switching Group 1 (270) | Group 2 (612) 
from/to brand A or B and 


other brands of Groups | and Enon ce as 20) 
2 of consumers From other than A or B | 99 (0.37) 359 (0.59) 
To AorB 180 (0.67) 225 (0.37) 


To other than A or B 90 (0.33) 387 (0.63) 
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Tables Brae eyatching Brand switching Group 1 (270) | Group 2 (612) 
from/to brand A in the third r 7m ou wre 
quadrant (Q3) and brands in van (0.44) (0.13) 
the first quadrant (Q1) of From brands in QI | 151 (0.56) 535 (0.87) 
Groups | and 2 of consumers ToA 115 (0.43) 73 (0.12) 


To brands in Q1 155 (0.57) 539 (0.88) 


that the brand switching of Group 1 includes brand A more than that of Group 2 
does. Group 2 is in QI, suggesting that the brand switching of Group 2 includes 
brands in QI more than that of Group 1 does. 

Table 4 shows the number of brand switchings from/to brand A (in Q3) and 
brands in Q1 for each of Groups | and 2. The brand switching from brand A is 
119 (0.44) for Group 1, while the brand switching from brand A is 77 (0.13) for 
Group 2. The brand switching to brand A is 115 (0.43) for Group 1, while the brand 
switching to brand A is 73 (0.12) for Group 2. The brand switching from brands in 
Q1 is 151 (0.56) for Group 1, while the brand switching from brands in Q1 is 535 
(0.87) for Group 2. The brand switching to brands in Q1 is 155 (0.57) for Group 1, 
while the brand switching to brands in Q1 is 539 (0.88) for Group 2. These figures 
support the suggestion mentioned above. 

Figure 3 shows the joint configuration of brands and groups along Dimension 3. 
Group | is in Q4, and is close to the origin. Brands A and B are in Q3, and the other 
brands are in Q1. In the brand switching, brands in Q4 are dominated by brands in 
Q3, and are dominant over brands in Q1 [4]. This suggests that the brand switching 
of Group | from brands in Q3, and that to brands in Q1 is larger than the other 
way round. Group 2 is in QI, suggesting that the brand switching of Group 2 is 
closely related to brands in Q1. The number of brand switchings from/to Qland Q3 
is already shown in Table 3, because brands A and B are in Q3, and the other brands 
are in Ql. 

The brand switching of Group | from brands in Q3 (brand A or B) is 171 (0.63) 
and those from brands in Q1 (brands other than A and B) is 99 (0.37), this supports 
the suggestion mentioned above. But the brand switching to brands in Q3 is 180 
(0.67) and that to brands in QI is 90 (0.33), this does not support the suggestion. 
For Group 2, the brand switching from brands in Q1 is 359 (0.59) and that to brands 
in Q1 is 387 (0.63). This shows that the brand switching of Group 2 is closely related 
to brands in Q1. 

The asymmetric multidimensional scaling based on the singular value decompo- 
sition was extended so that a joint configuration of brands and consumers (groups 
of consumers) is represented in a planner configuration. The joint configuration can 
represent relationships between consumers (groups of consumers) and brands in the 
brand switching. The extended asymmetric multidimensional scaling was applied to 
the brand switching among potato snack brands successfully. 

Each dimension represents a different aspect of the relationship between groups 
of consumers and brands in the brand switching. Dimension | represents that the 
brand switching of Group | is closely related to brands A and B, and that the brand 
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switching of Group 2 is closely related to brands other than A and B. This seems 
reasonable, because brands A and B have the two largest market shares among 
nine brands [26.4% and 20.5%, 6, Table 1], and Group 1 consists of consumers 
whose amount of money of purchasing potato snacks is larger or equal than the 
average. Dimension 2 represents another aspect of the relationship of two groups 
with brands A and B. It is disclosed that the brand switching of Group 1 is 
mainly related to brand A, and that the brand switching of Group 2 is substantially 
related to brand B, while the brand switching between A and B is not large [4]. 
Dimension 3 represents that the substantial brand switching of Group | was done 
from brand A or B to the other brands, and that the brand switching of Group 2 is 
mainly related to brands other than A and B. These findings were obtained by the 
joint configuration of brands and groups of consumers introduced by the adopted 
methodology of asymmetric multidimensional scaling. In the present study, groups 
were made up based on the amount of money of purchase. Studies using groups 
based on characteristics or buying behaviors of consumers seem interesting to be 
further considered in view of investigating the relationships within consumers and 
brands in the brand switching. 
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Evaluating the Quality of Data M®) 
Imputation in Cardiovascular Risk peels 
Studies Through the Dissimilarity Profile 
Analysis 


Nadia Solaro 


Abstract Missing data handling is one of the crucial problems in statistical 
analyses, and almost always is overcome by imputation. Although the literature 
is rich in different imputation approaches, the problem of the assessment of the 
quality of imputation, i.e., appraising whether the imputed values or categories are 
plausible for variables and units, seems to have received less attention. This issue is 
critical in every field of application, such as the medical context considered here, i.e., 
the assessment of cardiovascular disease risks. We faced the problem of comparing 
the results obtained with different imputation methods and assessing the quality of 
imputation through the dissimilarity profile analysis (DPA), which is a multivariate 
exploratory method for the analysis of dissimilarity matrices. We also combined 
DPA with the traditional profile analysis for data matrices in order to improve 
understanding of the differentiation components among imputation methods. 


Keywords Euclidean distance - Level - Missing data - Scatter - Shape 


1 Introduction 


Missing data handling is one of the crucial problems in statistical analyses, espe- 
cially in the presence of multidimensional data. Almost always, handling of missing 
data is accomplished through imputation. A datum not available for any reason is 
replaced, by a suitable imputation method, with a value, if variables are quantitative, 
or an attribute/modality, if variables are categorical. The statistical literature is 
rich in a multitude of different imputation approaches (e.g., non-parametric vs 
parametric imputation methods, single vs multiple imputation methods) [5, 7]. 
Fewer efforts seem, however, to have been addressed to the problem of how to assess 
the quality of imputation (Qol), i.e., how to establish whether imputed data are 
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consistent with the main features of variables and/or objects (or statistical units). The 
concept of Qol is undoubtedly related to the field of application. Nonetheless, two 
intertwined issues can be shared by many contexts: (1) the imputation plausibility 
for variables, i.e., detecting whether unrealistic values or categories have been 
imputed to incomplete variables—e.g., a negative value imputed to a variable that 
can assume only positive values—(Qol for variables); (2) the imputation plausibility 
for objects, i.e., assessing whether a datum imputed for an object is consistent 
with its profile as given by the values, or categories, or both, it has on the other 
variables (Qol for objects). We argue that these issues are critical in every field 
of application, all the more so in the specific medical context we considered, i.e., 
the assessment of cardiovascular disease (CVD) risks using information from the 
Autonomic Nervous System (ANS). In particular, we dealt with an overall dataset 
comprising variables collected from 88 different clinical studies undertaken over 
the period 1999-2014 and pertaining to specific groups of subjects (i.e., athletes, 
healthy individuals, smokers, stressed, obese and hypertensive subjects) [9, 12]. A 
missing data problem typically arises in contexts like this because of the adopted 
research protocols, which lay down rules for each clinical study. Depending on 
the protocol, measurements of some variables might not be contemplated within a 
specific study, especially if such information is too expensive or difficult to measure. 
In this sense, missing data appearing in the overall dataset can be regarded as 
generated by a MAR mechanism [12]. 

After having performed imputation on the overall dataset with different 
approaches, we faced the problem of comparing the results thus obtained by taking 
into account the consistency of the clinical group profiles against their expected 
traits. We then favoured the perspective of Qol for objects, instead of variables, 
although these two aspects are in strict relation. Comparisons among clinical 
group profiles imputed with the different methods were carried out by means of 
a multivariate exploratory tool for the analysis of dissimilarity matrices, i.e., the 
dissimilarity profile analysis (DPA) [8, 10], which here was also combined with the 
traditional profile analysis (PA) for multivariate data matrices [3] in order to deepen 
understanding of the differentiation components among profiles. 


2 The DPA Method in Short 


DPA is an exploratory multivariate statistical method designed for the analysis of 
dissimilarity matrices [8, 10]. It aims at investigating profiles of differences within 
the same set of objects to assess whether two objects differ to the others similarly 
or not, and then detect the main components that explain such differences. Given a 
set of n objects, let A be an (n x n) square and symmetric matrix containing the 
dissimilarities: 6;- = 6,;,i # r = 1,...,n, for which the usual properties hold [3]. 
In particular, matrix A has a zero-diagonal for the self-dissimilarities (6;; = 0 for 
all i). The elemental data of DPA are the dissimilarity profiles (DPs), which are 
given by the n row-vectors 5 = [6;1, 6j2,...,0,..., 5;n—1, din] of matrix A. Since 
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6: contains the estimates, according to any dissimilarity measure, of the degree of 
diversity of object i with respect to each of the other n — 1 objects, the i-th DP 
is expression of the pattern of difference of object i, and can then be analysed to 
detect the underlying features that distinguish this DP in comparison with the other 
DPs. This kind of analysis relies on the decomposition of the squared Euclidean 
(EU) distance d?, between DPs [8, 10]: d?, = S77) (1 — 5,1)” = d?... + 26? 


(ir) ir? 
Vi £r =1,...,n, where d2 = is =I, (6; — 6,;)? is the leave-ir-out squared 


(ir) ~~ 
EU distance, i.e., the squared EU distance between the i-th and r-th DPs computed 
excluding their direct comparison estimated by the term 25? ;,- Similarly to PA [3], 
distance di, y) can be decomposed in the three additive terms called DP level, DP 
scatter and DP shape components, respectively, according to the DP decomposition 
formula: 


den) = (n-—26i~. — bw)? + Ge. — bw)? + 2t1y.w. — Oan), (A) 


for each i # r [10]. The quantities appearing in (1) are: 


° in the DP level component (first term in (1)), the i-th DP leave-r-out level: 
Sir). = —ty W"/H1 ‘Sit, which is the level of the i-th DP excluding itself and 
Aixr 


its reciprocal dissimilarity with the r-th DP. Clearly, it holds: bir). > 0, for each 
rH lyin? 
e inthe DP scatter component (second term in (1)), the i-th DP leave-r-out scatter: 
De) — ye ay (dj — diy)’. which gives the spread of the i-th DP around its 
; IFi¢- 


r 
leave-r-out level. It is a kind of deviance of the i-th DP around its level that is 
computed excluding itself and the r-th DP; 

¢ inthe DP shape a (third term in (1)), the leave-ir-out shape of the pair 


(i, r) of DPs: (ir) = , where —1 < 6(j,) < +1, and iy) = pa eh, Gi1- 


ay i r(i).” 


bir). Ori _ b+(i).)- Quantity bir) is a sort of correlation coefficient = ae i-th 
and r-th DPs excluding their reciprocal dissimilarity. It measures the degree of 
similarity cir) close to 1) or diversity cir) close to —1) in the way that the two 
i-th and r-th DPs differ from the other DPs. 


If also an (n x p) data matrix X is known, the usual PA can be applied and then 
combined with DPA. To apply standard results of PA, dissimilarities 6;, in matrix A 
have to be computed as EU distances of the n row-vectors x; = [xij] j=1 


ey 


[3]. Then, the squared EU distance 57. can be decomposed, in its turn, in the three 
additive terms representing each the P level, P scatter and P shape components: 


8; = pK. — Fp) + (yj — vy)? +207 - gir), = WiAr=1,....n, (2) 


where x;. = 


eer _ . 

er is the level of the i-th profile; ve = Vi Gi _ x;.)" is 
a 

uit, 


the scatter of the i-th profile; gj; = is the correlation coefficient of the i-th 
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and r-th profiles, with -—1 < gj, < +1, and vj; = ya Oj — Xj.) (Xrj — Xp). 
To distinguish (2) from the DP decomposition (1), we denote formula (2) as 
the P decomposition. A critical requirement of PA is that variables have to be 
comparable regarding at least the unit of measurement. Otherwise, they have to be 
standardised [3]. 


3 DPA for Evaluating Qol in a CVD Risk Case Study 


Analysis of Qol is considered within the case study related to the role of ANS 
proxies in the CVD risk assessment, as just mentioned in Sect. | and described in 
[12] and [9]. The overall dataset contains a total number of n = 1314 subjects 
and includes the variables shortly described in Table 1, which refer to personal 
data, ANS proxies, baroreflex gain index, blood pressure measures and body mass 
index. ANS proxies are the measures of the heart rate variability (or RR variability) 
resulting from the spectral analysis of the electrocardiogram traces, while the 
baroreflex gain index concerns the mechanism that helps blood pressure remain 


Table 1 Definition of the variables considered in the case study 


Sets of variables Description 
e Personal data Age and gender (0 = Female, 1 = Male) 
e Set A ANS proxies: 
(pa = 7 common HR (heart rate, in beat/min) 
complete variables) RRMean (average of RR interval from tachogram, in msec) 


RRIP (total power, or RR variance, in ms?) 

RRLFnu & RRHFnu (normalised power of low and high frequency 
components, resp., of RR variance, in nu) 

RRLFHF (ratio between absolute values of LF and HF) 


Anthropometrics: 
BMI (Body mass index: weight in kilos/height in m”) 
e Set B ANS proxies: 
(pp = 8 incomplete RRLFHz & RRHFHz (centre frequency of LF and HF, resp., in Hz) 
variables) ARRLFnu (difference in LF power between stand and rest, in nu) 


Baroreflex gain index: 
AlphaM (frequency domain measure of baroreflex gain, 
in ms/mmHg) 
Blood pressure measures: 
SAP & DAP (systolic and diastolic arterial pressure, resp., 
by sphygmomanometer, in mmHg) 
SAPMean (SAP average of systogram, in mmHg) 
SAPLFa (absolute power of LF component of systogram, in 
mmHg”) 
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stable [6]. Apart from age and gender, wholly collected, the other p = 15 variables 
were divided into: (1) Set A, with the p4 = 7 common complete variables listed in 
Table 1, and (2) Set B, with the pg = 8 incomplete variables reported in Table 1. 
On the other hand, nc = 836 subjects have all the information (complete subjects), 
while the other ny = 478 subjects have missing values in set B (incomplete 
subjects). 

Imputation was carried out on the overall dataset through a multitude of different 
approaches [12]: (1) Non-parametric single imputation methods—IPCA (iterative 
principal component analysis) and FAMD (factorial analysis for mixed data) [4], 
FIP, FIM and WG.FIP (forward imputation) [11]; (2) Parametric multiple imputa- 
tion methods—EM algorithm [2], MIPCA (multiple imputation with the PCA) [4]; 
(3) Data fusion non-parametric methods—hot-deck-distance-based methods, such 
as distance hot-deck (NND) and random hot-deck techniques (RndNND) [1]. These 
various imputation methods had given rise to different imputed ANS profiles of 
the clinical groups. The crucial point was therefore to choose the imputation 
method that best met the expected within-group ANS profiles according to the prior 
knowledge we had of the group features. To this end, we applied DPA and PA to 
compare the various imputed ANS profiles obtained for the same group and assess 
Qol according to the strategy of analysis described in Sect. 3.1. 


3.1 Application of DPA and PA for Evaluating Qol 


Before applying DPA and PA, we had to take into account that the groups were not 
directly comparable by age and gender. Inspections of the imputed ANS profiles 
were then necessarily based on variables adjusted for age and gender effects [9, 
12]. The main traits of the groups were accordingly summed up by the adjusted 
median (AdjMed) profiles [12], which are of three types: (1) AdjMed profiles of 
the complete part (“benchmark profiles”), given by the within-group medians of 
the variables in sets A and B adjusted and standardised within the complete part 
of the data (i.e., the nc = 486 subjects); (2) AdjMed profiles of the incomplete 
part, computed on the set of the incomplete subjects as within-group medians of the 
variables in set A adjusted and standardised within the entire set of then = 1314 
subjects; (3) AdjMed imputation (AdjMedImp) profiles, computed on the set of the 
incomplete subjects as within-group medians of the variables in set B adjusted and 
standardised within the whole set of the n = 1314 subjects. 

We used these latter AdjMedImp profiles as input data of both PA and DPA, 
which we carried out within the groups separately considered. In particular, 
regarding PA, the input data matrix X, of clinical group g contains, in its rows, the 
ng = 9 AdjMedImp profiles, each referred to a specific imputation method, while, in 
its columns, the pg = 8 variables of set B (Table 1). Values in Xz are thus given by 
the within-group-g medians of the eight adjusted and standardised variables imputed 
by the nine methods. Regarding DPA, its input data are given by the DPs 5; 2 of the 
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AdjMedImp profiles within group g, i.e., the rows of matrix A, whose elements are 
the EU distances 6;, between the rows of matrix X,,i Ar=1,...,9. 

For each clinical group, we computed the DP and P decompositions described 
in Sect.2, formulas (1) and (2), and displayed the percentage results in the so- 
called complete DP decomposition plot. This plot is said as complete because it 
also includes the P decomposition (2). Besides, we referred to two further graphs to 
catch between-profiles differences better. The first graph, related to PA, is the profile 
plot, which depicts the above three types of AdjMed profiles against the variables in 
sets A and B. The second graph is the DP plot, which displays the normalised DPs 
of the imputation methods, i.e., the normalised EU distances of the AdjMedImp 
profiles: bir = ma? Vi #,r, since they help better detect the pairs of methods 
that are much more different/similar to the others. 

We then assessed Qol within each clinical group by examining the results of PA 
and DPA jointly and bearing in mind our prior knowledge of the main traits of the 
incomplete subjects belonging to the various groups. 


3.2. Assessment of Qol in the Athlete Group 


Given the richness of the obtained results, we are going to refer to the athlete 
group only. The upper panel of Fig. 1 (see also [9]) contains the profile plot of 
the three types of AdjMed profiles against the variables in set A (first seven) and 
set B (last eight). Incomplete athletes (thin dashed line) have higher (median) values 
of BMI, RRMean and RRHFnu, and lower (median) values of HR, RRLFnu and 
RRLFHF than the complete athletes (benchmark profile, thin solid line). That is 
consistent with what expected because incomplete athletes were known to have 
more powerful traits than the complete ones. Accordingly, regarding the imputed 
variables (set B), higher values of ARRLFnu and AlphaM are expected along with 
potentially lower values of RRHFHz and SAPMean [12]. The AdjMedImp profiles 
of each method over set B (black or grey lines with different styles) have therefore 
to be compared taking into account the trend drawn for the incomplete part. While 
all the imputation methods appear to give consistent results for ARRLFnu, only the 
methods IPCA, FAMD, WG.FIP, MIPCA and EMlogtr produce the expected higher 
values for AlphaM. EMlogtr, however, produces too high levels of SAPMean, while 
both EMlogtr and MIPCA impute too low values for RRLFHz and RRHFHz. IPCA, 
FAMD and WG.-.FIP then appear to produce more plausible imputed ANS profiles. 
Now, the crucial questions are: How similar/different are the results obtained 
in median by the considered imputation methods? Regarding which components 
do the AdjMedImp profiles mainly differ from one another? DPA was applied to 
address these questions and detect “different patterns of difference” between the 
methods. The lower panel of Fig. | reports the DP plot with the normalised DPs of 
the imputation methods (Sect. 3.1). Given the presence of the self-dissimilarities, 
each trajectory falls to zero in correspondence with the method to which it refers. 
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Fig. 1 Profile and dissimilarity profile analysis for the athlete group. Upper panel: Profile plot of 
adjusted median profiles (sets A and B). Lower panel: DP plot of the nine imputation methods 


More importantly, the DP plot admits a double reading, in both a pairwise and 
a DP comparison perspectives. In a pairwise comparison perspective, by fixing 
a method on the horizontal axis, one can check how far the trajectories of the 
other techniques are to the zero, i.e., to the fixed method. The higher (lower) the 
trajectories are, the more different (similar) the other methods are in comparison 
with the one fixed on the horizontal axis. For instance, WG.FIP (bold dashed line) 
and RndNND (grey two-dashed line) produce the most different median imputation 
results (647 = 574 = 1), while IPCA (bold solid line) and FAMD (thin dotted line) 
the most similar median imputation results (minjz, bir = 512 = 591 = 0.106). On 
the other hand, in a DP comparison perspective, DPs have to be analysed as whole 
trajectories. To compare a method DP with another one, we have to discard the part 
of the graph concerning their self-dissimilarities and reciprocal dissimilarity 5;,. In 
such a way, we can see how two different DP trajectories are over the other methods 
and thus ascertain whether the two methods share a similar pattern of difference 
compared to the others. For instance, the DP trajectories of WG.FIP and FAMD 
appear to be almost parallel, suggesting a substantial difference in the DP level 
component. 
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More thorough comparison analyses between trajectories are based on the DP 
and P decompositions (Sect. 2) of the squared EU distances d?. of the imputation 
method DPs. Figure 2 reports the complete DP decomposition plot concerning the 
percentage results of DPA (first row of panels) combined with PA (second row). 
Each of the components (DP level, DP scatter and DP shape, along with P level, 
P scatter and P shape) is depicted in a square plot. In the upper part of each plot, 
there are squares (DP decomposition) or circles (P decomposition) having areas 
proportional to the corresponding percentage recorded in the lower part. Over the 
six square plots, the percentages in the same position (i,7), with i > r, sum to 
100%. In this way, for each pair of methods, we can see which components among 
the six best explain the observed patterns of difference. As an example, we have 
just noted that WG.FIP and RndNND have the most different median imputation 
results. By Fig. 2, their reciprocal dissimilarity represents the crucial component of 
difference, i.e., the P component (amounting in total to 74.84%), rather than their 
DP component (amounting in total to 25.16%). In other words, they mostly differ 
reciprocally in the way they impute values to variables in set B rather than in the 
way they differ from the other methods. For a better understanding, Fig. 3 provides 
two pairwise plots, the first for the P component, the second for the DP component. 
Regarding the reciprocal dissimilarity (P component), the upper panel contains the 
pairwise profile plot of WG.FIP and RndNND, which is obtained from the profile 
plot in Fig. | (upper panel) using only the variables in set B. The two trajectories 
differ quite exclusively in shape (74.34%, Fig. 2), i.e., WG.FIP and RndNND tend to 
impute similar values in median but with the exceptions of AlphaM and SAPMean. 
Regarding the DP component, the lower panel in Fig. 3 shows the pairwise DP plot 
of WG.FIP and RndNND, which is taken from the DP plot (Fig. 1, lower panel) by 
removing their self-dissimilarities and reciprocal dissimilarity. Also in this case, the 
two trajectories differ quite exclusively in shape (22.14%, Fig. 2) according to an 
opposite trend (647 = 674 = —0.67). That means that WG.FIP and RndNND differ 
from the other methods by a different pattern. Where in the pairwise DP plot the 
WG.FIP trajectory is low (high), thus intending its similar (different) performance 
to some methods, the RndNND trajectory turns out to be high (low), thus denoting 
its different (similar) performance to those same methods. 

Finally, returning to IPCA, FAMD and WG.FIP, which were just recognised as 
good candidates for imputation in the athlete group, by Fig.2, they mostly tend 
to differ from one another for the DP component rather than the P component. 
In particular, the profile plot in Fig. 1 (upper panel) shows that they have similar 
profiles, which mainly differ in the P shape component for the way in which they 
impute values of AlphaM and SAPMean. Regarding the DP component, IPCA, 
FAMD and WG.FIP tend to differ similarly to the other methods, i.e., they share 
a similar pattern of difference because the DP level component weighs more than 
DP scatter and DP shape. In other words, imputation of IPCA, FAMD and WG.FIP 
differs similarly from the other methods for the magnitude of the median of the 
imputed values. The final choice among IPCA, FAMD and WG.FIP is also based on 
the third quartile of the adjusted imputation (Adj-Q3-Imp) profiles. The upper panel 
of Fig. 4 provides the profile plot of the Adj-Q3-Imp profiles of IPCA, FAMD and 
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Fig. 3. Athlete group: Pairwise plots of WG.FIP and RndNND. Upper panel: Pairwise profile plot 
(P level: 0.14%, P scatter: 0.37%, P shape: 74.34%, Fig. 2). Lower panel: Pairwise DP plot (DP 
level: 2.98%, DP scatter: 0.04%, DP shape: 22.14%, Fig. 2) 
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Fig. 4 Athlete group: Comparison among IPCA, FAMD and WG.FIP using the third quartile of 
the adjusted imputation profiles. Upper panel: Profile plot. Lower panels: Pairwise DP plots 
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WG.FIP, the lower panels the pairwise DP plots for the three pairwise comparisons. 
Similar remarks as those advanced for the AdjMedImp profiles can be made, with 
the unique exception that here IPCA and FAMD have the DP shape component 
higher than the DP level component (results omitted). As a final choice, WG.FIP 
is preferred to IPCA and FAMD because it guarantees higher imputed values of 
AlphaM and smaller imputed values of SAPMean also for the first 75% of the 
athletes. 


4 Conclusions 


Evaluation of quality of imputation (Qol) performed in the considered CVD risk 
assessment context was carried out by using two multivariate exploratory statistical 
methodologies, i.e., DPA and PA, in an integrated manner, and exploiting the 
available prior knowledge of the main features of the subjects having missing 
information within the clinical groups. From a practical point of view, one of the 
central facts emerged from the study was that the same imputation method might 
prove to be satisfactory for a clinical group, or a subset of subjects within it, but 
not for the other groups. For instance, the incomplete healthy subjects involved in 
specific clinical studies were known to be close to a hypertensive state. Compared 
with the set of the complete healthy subjects, lower levels of AlphaM along with 
higher values of blood pressure measures were then expected. The imputation 
method that reflected at best such a trend proved to be the distance hot-deck NND 
rather than WG.FIP as for the athletes. In general, a more suitable strategy to have 
realistic ANS imputed profiles could be that of switching from one imputation 
method to another depending on the features of the incomplete subjects, instead of 
applying a unique imputation method to the whole dataset. Such kinds of inspections 
would, however, require the availability of an integrated set of interactive diagnostic 
tools, capable of combining PA and DPA with the prior knowledge, when at hand, 
about missing features. The DPA method is at its early stage of development so 
that many aspects are still work-in-progress, e.g., the mathematical handling of DPs 
by interpolation or smoothing techniques. However, unlike the DP decomposition, 
the form of the DP trajectories is not invariant to the order in which the objects are 
taken. A primary challenge then is to set up procedures by which the objects can be 
arranged in a non-arbitrary order, or the horizontal axis of a DP plot can get metric 
properties. 
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Part III 
Statistical Modeling 


Measuring Economic Vulnerability: A ® 
Structural Equation Modeling Approach si 


Ambra Altimari, Simona Balzano, and Gennaro Zezza 


Abstract Macroeconomic vulnerability is currently measured by the United 
Nations through a weighted average of eight variables related to exposure to shocks, 
and frequency of shocks, known as Economic Vulnerability Index (EVI). In this 
paper we propose to extend this measure by taking into account additional variables 
related to resilience, i.e., the ability of a country to recover after a shock. Since 
vulnerability can be considered as a latent variable, we explore the possibility of 
using the Structural Equation Model approach as an alternative to an index based 
on arbitrary weights. Using data from a panel of 98 countries over 19 years, we test 
our results with respect to the ability of the indices based on weighted averages, or 
on the SEM, in explaining the growth rate in real GDP per capita. 


Keywords Hierarchical component model - Partial least squares - Structural 
equation modeling - Vulnerability index 


1 Introduction 


In the ongoing discussion on how to measure well-being and poverty, especially in 
relation to the allocation of international aid, the concept of economic vulnerability 
has emerged as potentially more useful than measures of poverty. 

Several measures of vulnerability are used in the literature: they are mainly 
defined as composite indicators, typically computed as weighted averages of a set of 
indicators, where all indicators are assumed to have arbitrary (mostly equal) weights 
and to be uncorrelated to each other (i.e., correlation among them is ignored). 

In order to identify countries that are eligible to enter or leave the Least 
Developed Countries category the United Nations refers, among other measures, 
to the Economic Vulnerability Index (EVI, [4]). 
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The EVI is computed as the average of two sub-indices: an Exposure Index and a 
Shock Index, which are weighted averages of five and three variables, respectively. 
As such, the EVI focuses on risk, but neglects measures of resilience, i.e., the ability 
of a country to recover after a shock. 

We start our analysis from all EVI variables, observed on 98 developing countries 
between 1990 and 2013.! 

Our contribution moves along two dimensions: 


* we propose to extend the EVI model, including additional variables” affecting 
Resilience. We refer to this new specification as Extended—EVI, and compute its 
value as a weighted average of its determinants; 

* we use the Structural Equation Model (SEM) approach to estimate the vulner- 
ability index, both in its original specification and in our extended version. In 
this way, a weighting system deduced from the data replaces the fixed arbitrary 
weights for each year, and the correlation among (blocks of) variables plays its 
role in determining the vulnerability score. For this purpose we use the Partial 
Least Squares approach to Structural Equation Model (PLS—SEM) [5, 9, 11], 
whose aim is mainly predictive, to estimate the vulnerability index. 


2 The Economic Vulnerability Index (EVI): A Possible 
Extension 


The list of the base EVI variables is given in Table 1, the first five indicators 
measuring the exposure to risks and the last three referring to the outcomes of a 
shock. 

All variables are expressed on a 0-100 scale and are measured so that a higher 
value implies higher vulnerability, i.e., so that they enter the index with positive 
weights. 

The Extended—EVI includes nine additional indicators (listed in Table 2) affect- 
ing resilience. These new variables account for the economic strength (1-3) and the 
strength of institutions (4-9). 

In the new specification we also include four additional variables for exposure 
to risk (see Table 2).? Additional variables have been selected for their theoretical 
relevance for vulnerability, subject to the availability of the data over time for a large 
enough number of countries. 


'EVI data can be downloaded from http://byind.ferdi.fr/en/evi. 

2Data sources: UN databases: UNSD-NA, UN-PD, UNCTAD Stat, FAOSTAT; World Bank 
databases: WDI, WGI; Centre for International Earth Science Information Network (CIESIN); 
Emergency Events Database (EM—DAT); Centre d’Etudes Prospectives et d’ Informations Interna- 
tionales (CEPII). 

3For a detailed description of the variables, see the companion page at http://gennaro.zezza.it/files/ 
abz. 
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Table 1 EVI variables [4] 


1. Exposure 1.1 Population (smallness) 

1.2 Remoteness from world markets 

1.3 Export concentration 

1.4 Share of agriculture, forestry, and fisheries in GDP 

1.5 Share of population living in low elevated coastal zone 
2. Shock 2.1 Victims of natural disasters 

2.2 Instability of agricultural production 

2.3 Exports instability 


Table 2. The additional variables in the extended—EVI 


3. Resilience 3.1 Net flows on external public and publicly guaranteed debt (%GDP) 
3.2 Debt service on external debt PPG (%GDP) 
3.3 Gross fixed capital formation (%GDP) 
3.4 Control of corruption 
3.5 Government effectiveness 
3.6 Political stability and absence of violence/terrorism 
3.7 Regulatory quality 
3.8 Rule of law 
3.9 Voice and accountability 
1. Exposure 1.6 Surface area 
1.7 Import concentration 
1.8 Foreign direct investment net inflows (%GDP) 
1.9 Net official development assistance and official aid (%GDP) 


3 The PLS Approach to Structural Equation Model 


The PLS—Path Modeling (PLS—PM) [3, 9, 11], i-e., the component-based approach 
to SEM, is one of the most used estimation methods for latent variable models, 
such that the name PLS—SEM is often used as an alternative in the most recent 
literature [5]. It represents the main alternative to LISREL [6], differing, among 
other things, in the main aim being pursued, i.e., predictive (PLS—PM) versus 
confirmatory (LISREL), that makes it preferable for our aims. 

Given a data matrix X, partitioned by column in J blocks, a path diagram (Fig. 1) 
is the typical representation of a causal model where each block X; (j = 1,..., J) 
is a set of manifest variables and is conceptually connected to a latent variable ;. 

In such a diagram, rectangles represent manifest variables (MV), ellipses latent 
variables (LV), and arrows the relations between them, which are supposed to be 
linear. In Fig. | two models are shown, i.e., the base EVI model (in white) and the 
enlarged model including (in gray) the resilience and the additional variables for 
exposure. 
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Fig. 1 The EVI path model 


2 


The PLS-—PM algorithm computes latent variables scores, path coefficients (or 
inner weights), and outer weights. It is based on alternating, until convergence, an 
external and internal estimate of the LV, based on sets of regressions. 

For details about the algorithms see [9] and [3]. 

We specify both the models for estimating the two versions of the EVI according 
to the Hierarchical Component Model, also referred to as Repeated Indicators 
Approach [7, 8, 10, 11], suitable to model structures with nested constructs. It is 
typically used for modeling composite indicators, when a set of sub-indices (first- 
order constructs) compound the global index (second-order construct) [5]. In such 
a model, all indicators in each of the J blocks are put together and used to define a 
(J + 1)-th one, the so-called super-block, i.e., a higher-order latent variable whose 
final score is interpreted as the estimate of the composite indicator estimation. 
Thereby the vulnerability is estimated both as a linear combination of the manifest 
variables and as compound by sub-indices. 

Moreover, we use consistent Partial Least Squares (PLSc) [1, 2] as estimation 
method, since the most recent literature suggests its adoption when some constructs 
are defined as factor models (as it is the case for the Resilience construct in our 
model). 

The path model in Fig.1 reproduces the conceptual structures of the two 
formulations of the index (EVI and Extended—EV]) in its basic specification. We 
specified this model according to the blocks’ internal consistency. In particular we 
set Exposure and Shock, lowly internally correlated, as formative blocks and we 


Economic Vulnerability: A SEM Approach 99 


add Resilience, which is unidimensional, as a reflective construct. On the structural 
part, the two (or three in the extended model) sub-indices are exogenous towards the 
Vulnerability, as conceptually Exposure and Shock (and Resilience in the extended 
model) can be seen as determinants of the global index. 


4 Results 


Results of this analysis consist in the estimation of 19 models, one for each year. 

PLS-SEM assessment is based on measures indicating the model’s predictive 
capability, mostly in terms of reliability and validity of the construct measures. 
Assessment of measurement models includes for each reflective block an internal 
consistency reliability index, measured by the Cronbach’s alpha and assessing 
the unidimensionality (necessary condition to define a block as reflective), and a 
convergent validity index, measured by the average variance extracted (AVE). 

For the block Resilience these indices range in the intervals [0.73; 0.79] and 
[0.41; 0.45] respectively, denoting a quite acceptable block structure. 

A possible criterion to evaluate a formative block is the relevance of the 
indicators, to verify if they truly contribute to forming the construct. This is 
assessed by testing if the outer weights, i.e., the relative contributions of indicators, 
significantly differ from zero. And, since outer weights’ significance can be affected 
by external factors, like the number of indicators in the block, it is convenient 
to consider also outer loadings, i.e., the absolute contribution of the indicator. 
In the estimated models bootstrap confidence intervals of outer weights are not 
significant in some cases, but most of the times they are associated to quite high 
and/or significant values of the outer loadings. Even if this condition is not always 
met (both across the 19 models and for all indicators), at an exploratory level we 
consider the model to be generally acceptable. 

In order to evaluate separately the relevance of the extension of the EVI, and the 
adoption of a multivariate approach, we estimate two different models: 


¢ the SEM-EVI: PLS-SEM estimation using the 8 base EVI indicators as manifest 
variables, related to 2 exogenous latent variables (Exposure and Shock) explain- 
ing an endogenous super-block (Vulnerability); 

e the New-EVI: PLS-SEM estimation using the 21 indicators (8 base + the 
additional 13) as manifest variables, related to 3 exogenous latent variables 
(Exposure, Shock, and Resilience) explaining the endogenous super-block (Vul- 
nerability). 


We have also computed a fourth index, named Extended—EVI, as the average of 
our 21 indicators. Results obtained using different models/estimation methods are 
compared between each other over time at an empirical level. Table 3 shows the four 
indices, according to model specification and estimation method. 

In the following we will start comparing the indices on the first row, to consider 
the consequences of just adding new variables to the base EVI, without changing 
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Table 3. The compared Model specification 


moa Estimation method | 8 variables | 8+13 variables 
(Weighted) Average | EVI Extended—EVI 
PLS-SEM SEM-EVI | New-EVI 


the estimation methods, and we will later compare the indices by the columns, to 
consider the consequences of using the SEM approach. 

We will point out how the two indices perform in terms of (1) the coherence of the 
results with the underlying theoretical model and (2) the capability of the estimated 
indices to explain real GDP growth, which we choose as a simple aggregate measure 
of economic performance. 

In synthesis, in showing our results we will refer to the following scheme: 


1. To evaluate the internal coherence of the indices we mainly use some descriptive 
analysis of (1) signs and values of the estimated weights and their trend over time; 
(2) trend, correlations between and autocorrelations of indices; (3) countries’ 
final rankings; 

2. We next test the predictive power of the indices by regressing them on real GDP 
growth. 


4.1 Comparing Different Models’ Specifications 


We compare EVI with Extended—EVI to observe how results change by simply 
adding the new 13 variables (9 for resilience plus 4 more for exposure) to the 
classical index. Correlation between the original measure and the enlarged measure 
is reasonably high (0.63). 

Country by country comparison of the two indices show similar trends over time 
for most countries, albeit with exceptions. Comparing rankings obtained by the 
two indices, we notice that the countries with the largest change in their positions, 
such as Tonga or St. Kitts and Nevis, are those—as expected—characterized by the 
highest value for our resilience indicators. 


4.2. Comparing Estimation Methods 


As mentioned above, the main expected consequence of using SEM is to let the 
index weighting system emerge from the data. In most cases the outer weights 
of manifest variables on both SEM-EVI and New-EVI show basic instability. 
Moreover, the presence of some negative weights for some variables points out that 
the use of positive (and constant in time) weights in the classical EVI is not justified 
by the data. On the other hand, the assumption of positive weights does not rely on 
positive correlations between variables. 
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Table 4 Regression on GDP Growth in real GDP per capita 


en Growth(t-1) | 0.159" [0.148 | 0.160" | 0.156% 
EVI —0.020 
Extended-EVI —0.162* 
SEM-EVI | 0.053 
New-EVI —0.047* 
Intercept 2.73° | 7.856 | 0.144 | 3.8098 
N 1706 | 1706 1706 | 1706 
Adj R? 0.37 0.38 0.37 | 0.37 


Coefficients identified by (a) are significant at 1%. Coeffi- 
cients identified by (*) are significant at 5% 


The correlation between the official EVI and the computed SEM-EVI indices 
is generally low, ranging in [—0.12; 0.22], and non-significant, while the two 
indices individually show a high autocorrelation from | year to the next, ranging in 
[0.97; 0.99] (all significant values) for SEM-EVI and in [0.72; 0.99] for New-EVI, 
(with the exception of year 2010 that is negatively correlated with all other years, 
values in [—0.97; —0.77]). For both autocorrelation is decreasing with distance in 
time, proving their strong internal coherence. 

In other words, the two indices provide two different measures of vulnerability, 
each with its own ranking, but both have their own internal coherence. Extended— 
EVI is instead highly correlated to New—EVI, both for the whole sample and for 
many countries.* 


4.3 Does Vulnerability Help Explain Growth? 


In addition to the comparisons among indices, we have tested which of the 
proposed models performs better in explaining growth in real GDP per-capita, where 
vulnerability should have a negative impact on growth.> In Table 4 we report the 
results of four fixed-effects panel regressions on GDP growth for each of the four 
indices, adding lagged GDP growth as an additional explanatory variable. 

We notice that the SEM-EVI has the “wrong” sign, while the two extended 
measures of vulnerability have the best performance, in terms of overall explanatory 
power. We control for five outliers: three associated to exceptional positive growth 
rates in GDP, and two related to large drops in GDP given by war episodes (Lybia 
2011; Central African Republic, 2013) 


4More detailed results are available at http://gennaro.zezza.it/files/abz. 


5We are aware that this analysis cannot rule out the possibility that GDP growth has an impact on 
vulnerability, and that therefore our explanatory variables may not be weakly exogenous. 
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5 Final Remarks 


We have analyzed the measure of economic vulnerability adopted by the United 
Nations, EVI, and proposed to extend it by considering additional indicators to 
take into account the ability of a country to recover from shocks. We have further 
proposed a multivariate approach, based on PLS—SEM, for estimating economic 
vulnerability indices. The proposed enlarged index performs better than the classical 
one in predicting the growth rate in real GDP per-capita, thus validating the 
usefulness and external coherence of our approach. 

The multivariate approach has also shown that some of the manifest variables 
enter the EVI with low, in some cases negative, weights, casting doubts to the 
appropriateness of the EVI base model. Doubts are powered by the weakness 
of assessment measures provided by the basic and extended indices path models 
estimation. 

As the analysis covers 98 countries on 19 years, we also investigated the possible 
role of time in repeated PLS estimates over the years. We consider that the relative 
stability of scores obtained in repeated PLS estimates to be reassuring in terms of 
model specification, but further results on applying the PLS method to a panel, 
fully exploiting the information contained in autocorrelation and cross-correlation 
of variables, will require further work. 
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Bayesian Inference for a Mixture ® 
Model on the Simplex pes 


Roberto Ascari, Sonia Migliorati, and Andrea Ongaro 


Abstract The Flexible Dirichlet (Ongaro and Migliorati, J. Multivar. Anal. 
114:412—426, 2013) is a distribution for compositional data (i.e., data whose support 
is the simplex), which can fit data better than the classical Dirichlet distribution, 
thanks to its mixture structure and to additional parameters that allow for a more 
flexible modeling of the covariance matrix. This contribution presents two Bayesian 
procedures—both based on Gibbs sampling—in order to estimate its parameters. 
A simulation study has been conducted in order to evaluate the performances of 
the proposed estimation algorithms in several parameter configurations. Data are 
generated from a Flexible Dirichlet with D = 3 components and with representative 
parameter configurations. 


Keywords Compositional data - Dirichlet mixture - Bayesian estimation - 
MCMC 


1 Introduction 


Some kind of data are defined on unusual mathematical spaces instead of classical 
ones as R?.. For instance, compositional data belong to the D-dimensional simplex, 
defined as: 


D 
SP = xR? ix, >0,) xj =1 
i=l 


R. Ascari (&)) - S. Migliorati - A. Ongaro 

Department of Economics, Management and Statistics, University of Milano-Bicocca, 
Milano, Italy 

e-mail: roberto.ascari@unimib.it; sonia.migliorati@unimib.it; andrea.ongaro@unimib.it 


© Springer Nature Switzerland AG 2019 103 
F. Greselin et al. (eds.), Statistical Learning of Complex Data, 

Studies in Classification, Data Analysis, and Knowledge Organization, 
https://doi.org/10.1007/978-3-030-21140-0_11 


104 R. Ascari et al. 


This means that data x are positive vectors subject to a unit-sum constraint (i.e., 
proportions). Note that compositional data are prevalent in many disciplines (e.g., 
geology, medicine, economics, psychology, environmetrics, etc.); therefore, their 
proper treatment is a relevant issue. The Dirichlet is the most known distribution 
defined on the simplex. Although it has several mathematical properties, in many 
real applications it does not fit the data well, due to its extreme forms of simplicial 
independence or stiffness in cluster modeling. 

In order to overcome these drawbacks, a new model for this type of data has 
been proposed [4]: the Flexible Dirichlet (FD). Since this model can be represented 
as a finite mixture with particular Dirichlet components, it follows that it is more 
adequate to capture cluster structure in data. In the literature an estimation procedure 
based on the EM algorithm already exists [3]. The aim of this contribution is to 
introduce a new parametrization for the FD distribution and to use it for developing 
a new Bayesian estimation procedure. 

More precisely, in Sect.2 we present the FD model and show some of its 
properties (i.e., the finite mixture structure and first and second moments). Then, 
in Sect. 3, we propose a first Bayesian procedure in order to estimate the parameters 
and point out its drawbacks. In order to overcome them, in Sect.4 we propose a 
new parametrization that provides a variation independent parameter space, thus 
allowing to build up an efficient Gibbs sampling algorithm. Finally, in Sect.5 we 
present a simulation study and show the results in two representative parameter 
configurations. 


2 The Flexible Dirichlet Distribution 


The FD distribution is obtained by normalizing a vector Y = (Y,..., Yp) with 
positive dependent elements, where Y; = W; + Z;U, (i = 1,...,D), Wi ~ 
Gamma(q;, 8) are independent random variables (r.v.), U ~ Gamma(t, B) is a 
further independent r.v., and Z = (Z;,..., Zp) ~ Multinomial(1, p) is a random 
vector independent of W;’s and U. Let Y + = sae Y;; then the normalized vector 
X = Y/Y* is distributed as a FD(«, p, T) and its density function is: 


T(at - 2 T(q; 
Srp(&; &, T, p) = aoe (1 =) Yop a) se, 


a Tai +t) 
TT Pr) \a isi ; 
where x € SY? at = yy ai, = (@1,...,@p),a; > 0,7 > 0,0 < p; < 1, 
and yo 1 Pi = 1. Its distribution function can be written as a finite mixture with 


particular Dirichlet components: 


D 
FD(x; a, T, p) = > Pi B(x; ai), 


i=1 
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where Y(-; -) denotes the distribution function of a Dirichlet r.v., @; = «@ + te; and 
e; is the vector whose elements are equal to 0 except for the ith element which is 
equal to 1. We recall the first two moments of this distribution: 


at+tp a at T 
= E[Xla, ; —— ee ee 
ve Lal) at+t = (45) +e(5) 





E[X; lo, t, p]- 1 — E[Xila, 7, pl) tpi — pi) 
Var(X; |e, T, p) = —=——AMDMDAA _ saw 
at+r+1 (at + t)(at+ +741) 
ELX; |, t, p] - ELX; lo, 7, ? Di 
CivGe: Rie pie [Xi|w, t, p]-E[X,lo,t,p] _ T* Pi Pr 
at+tr+1 (a+ + t)(at +741) 
i,r = 1,...,D,i # r. Thanks to the mixture structure highlighted above, the 


density function of the FD can take on several shapes, including a number k < D 
of different modes. Moreover, the FD can represent a good model for clustering, 
since it is a “structured” mixture with links among the component parameters, as it 
emerges from the definition of «;. Each mixture component defines a cluster, whose 
vector mean is: 


a+ Te; 
at +r/ 


These cluster means deserve a very clear and simple geometric interpretation, as 
they are linear convex combinations of a common “barycenter” @/at and the ith 
simplex vertex e;. Thus, the ith element of 6; is higher than the ith element of 6 ;, 
for every j ~ i. This introduces a very simple and reasonable form of differentiation 
among components, which is able to capture a broad range of cluster dissimilarities. 
The parameter > = measures the distance between each cluster mean 0; and the 
common barycenter yg in direction of e;. Details about the E-M based procedure for 


obtaining the MLEs of the FD’s parameters can be found in [3]. 





3 Bayesian Inference via Gibbs Sampling 


First of all note that strong identifiability of the FD [4] ensures that this distribution 
does not show invariance under permutation of the mixture components. Therefore, 
no label switching problems arise in the estimation process. 

In order to implement the Bayesian estimation procedure we need to define the 
likelihood function and the priors. Let x = (X1,...,Xj,..., Xn) be a sample of size 
n from X ~ FD(a, p, T), then the complete-data likelihood function can be written 
as: 





; ji 
Peta Pett) "11 T@) 


ee ee CS ee 
Lo, S8;e,t,p) =] []] pi —————x7, | | : (1) 
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where zj; is equal to 1| if the jth observation has arisen from the ith cluster of the 
mixture (i.e., S; = 7) and 0 otherwise [6]. 

As for prior elicitation, we can assume that p and (a, t) have independent prior 
distributions. In this way we can choose a Dirichlet (e0, ..., e9) prior for p, where 
eo € Rt. This choice is coherent with the literature, where the Dirichlet with 
equal hyperparameters is the standard prior for the weights of a finite mixture 
model [1]. Another simple choice is to impose independence among t and each 
a; (1.e., T(@, T) = W(T) Th , 7 (a; )) and select a reparametrized exponential prior 
distribution for each element of the rv. (@1,...,@p,T), which greatly simplifies 
computation of the full conditionals. Thus, we have: 


D 
m(a,T) x b® | Ja": (2) 


i=1 


where (a1,..., ap, b) are positive hyperparameters. 

Then, the Gibbs sampling implementation can be devised as follows. Let S 
denote the vector of missing group labels (i.e., S; = i means that the jth observation 
has arisen from group 7). Then, the algorithm is composed by the following steps: 


1. Obtain an initial classification S© of data into D groups. Repeat steps 2 and 3 
form=1,...,B,...,B4+N. 
2. Given S“"—)), sample parameters from their full conditionals: 


* Sample p” from 2(p|S“~)), x) 
e Sample (au, rm)) from (a, r/ser-), x) 


3. Given the new parameters (a), 1”), p), sample a new partition S“”) from 
(Sloe, 7™, pm 


If we choose a Dirichlet prior for p, then 2(p|S“"—",x) has a Dirichlet 
distribution with parameters (¢€1,...,ep), where ej = eg + Nj (S@—-D) and 
N;(S“—)) is the number of data points assigned to group i in partition S/"—), 
In order to obtain a new data partition S“”, in step 3, we can generate a vector from 
a Multinomial (1, P;) and assign to a the position in which the 1 occur, where 
Pi = (Di> +--+ Pip) and: , 


(m) (m) 
"fg (j3 0; 
p, — Pr (s; th ioe”, ra), p”) = Pj ( Joi ) (3) 


~ ynD (m) : 
ae Pi, Sa (x;; |”) 


Gi = 1,..., D) where fg(xj;; a) is the density function of a Dirichlet rv. and 
lp = a+ Te. 

The main issue in this Gibbs sampling is the generation of values from the full 
conditional z(a, tS"), x). One can show that the latter represents a distribution 
difficult to generate from whatever prior we choose for (a, tT). Given the prior (2) 
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we can compute the full conditionals: 


Tat n rT N,(S) D 
1 (a |o(—-1), T, S, x) js | es | a I] I] xii 


P(a) (qj + T) inl j-Sj=i 
=1 j:Sj= 
D D 
(tla, S,x) « [Pat + t)]" I] [I (oj + tr) NS) bt I] I] xii 
i=) i=l j:Sj=i 
where a(-l) = (a, see Q@(I—-1), M(I4+1)5--+5 ap). 


Unfortunately, these full conditionals do not characterize some known distribu- 
tion, so an inverse transformation method (ITM) has been implemented in order to 
obtain exact values from these distributions. This method requires the numerical 
evaluation of D + 1 integrals in order to compute the normalization constants for 
the full conditionals and one more numerical integration to obtain the distribution 
function of each one of the full conditionals. Finally, we have to numerically find the 
percentile associated with a value generated from a uniform distribution on (0, 1). 
This involves a time-consuming algorithm (i.e., slow convergence of the Gibbs 
sampler) though, as it has emerged from simulations we have implemented in R [5]. 


4 A New Parametrization 


In order to overcome this drawback we propose the following new parametrization 
for the FD model: 


og=at+t 
Pp=Pp 


Sr 
ll 


_~ 
ll 
[a SIR 
+ 
Sr 
ao) 


This parametrization allows for an interesting interpretation of parameters: p are 
the usual weights of a mixture model, w represents the overall mean vector, ¢ is 
a precision parameter, and w measures the distance of each cluster mean from the 


common barycenter jz. One can show that w < min; min | ae i}, so we can define 
. J 


w 


a normalized version of W, i.e., w = —. In this way the parameter space 


; fe 
min; mini 1 
J Pj 


is variation independent, so that we can choose independent priors: 


w~ Beo,.--,e0) {? ~ Gamma(g1, g2) (4) 


w ~ Unif(0, 1) p ~ Fdo,..., do) 


with eo, do, gi, and gz as positive hyperparameters. This set of priors ensures 
noninformativity, or at least vagueness, in the estimation procedure. Indeed, the 
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Dirichlet distribution with equal hyperparameters treats all the components alike. 
Moreover, the Gamma distribution is a common choice for the prior of the precision 
parameter, and, by choosing “small” values for the hyperparameters g; and go, 
vague priors are obtained. If we set g; = gz then there is a large prior probability 
on observed values all close to zero or one (as in this case the a;’s in the original 
parametrization would be less than 1). If this is not considered in tune with one’s 
prior opinion, one might choose prior distributions for @ with higher mean, though 
still keeping a large variance (i.e., g3 = kgo, with k € {10, 60, 100} and gz € 
{0.01, 0.001, 0.0001}). We implemented a Gibbs sampling algorithm by means of 
the BUGS software [2] in order to sample from the joint posterior distribution. 
The likelihood function used in this model is the complete-data likelihood function 
given by (1) written in terms of the new parameters. This is coherent with the Gibbs 
sampling structure described at the beginning of Sect. 3. 


5 Simulation Study 


In order to evaluate the performance of this Gibbs sampling algorithm, we simulated 
samples from a Flexible Dirichlet with D = 3 for several configurations of 
parameters. Priors as in (4) have been chosen with eg = dy = | and g} = g2 = 
0.0001. For space constraints, we report only the results for two representative 
parameter configurations: one with well separated clusters and one with overlapped 
clusters. The latter is a challenging scenario for every cluster-based approach, due 
to the difficulty to identify groups of homogeneous observations. In Fig. 1 we can 
see a simulated dataset for each of these scenarios. 

We have generated 200 samples of size 150 for each parameter configuration 
and, for each of them, we initialized an MCMC chain of length 25,000 (B = 10,000 
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Fig. 1 Two dataset simulated from FD with: w = (0.333, 0.333, 0.333), p = (0.333, 0.333, 


0.333), 6 = 47, w = 0.362 (left panel) and pw 
0.333, 0.333), @ = 58.5, w = 0.116 (right panel) 


= (0.271, 0.339, 0.390), p = 


c(0.333, 
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Table 1 Simulation results for well separated clusters 


Parameter 
i 0.384 
i 0.335 
us 0.331 
Pi 0.334 
Pa 0.337 
P: 0.399 
‘ 47.824 
w 0.390 





MLE 

0271 
0.340 
0.389 


0.337 
0.344 
0.319 
59,332 
0.152 





of which as burn-in) and, to properly treat autocorrelation, we set a thinning value 
equal to 10. Graphical tools (i.e., trace plots and mean plots) have been used in order 
to verify the convergence of the chain to the stationary distribution. In Tables 1 and 2 
we have reported the mean of the 200 posterior means, the mean of the 200 posterior 
Standard Deviations (SD), the mean of the Maximum Likelihood Estimates (MLE) 
and of their Standard Errors (SE). 

From Table | it emerges that, when clusters are well separated, our Bayesian 
procedure produces more accurate and less variable estimates than the E-M based 
ones. Nonetheless, if clusters are too closed together (see Table 2), both approaches 
do not provide an unbiased estimation of the parameters, as we expected due to 
the data structure. Though, in this scenario the classical procedure is preferable: 
the precision parameter @ and w are heavily underestimated with the Bayesian 
approach, while the ML procedure overestimates them only slightly. 

Finally, note that our estimation procedure is robust with respect to the choice 
of the hyperparameters: even with different values of eo, do, gi, and g2, we have 
obtained similar results as the ones showed in Tables 1 and 2. Furthermore, it is 
also robust with respect to the choice of the loss function: due to the approximate 
symmetry of each marginal posterior distribution, the posterior means are very close 
to the posterior medians and posterior modes. 
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In conclusion, we have introduced a new parametrization that allows to set up a 
more efficient Gibbs sampling algorithm. This new Bayesian method is very precise 
when data show separated clusters, but it does not work as well as the classical 
estimation procedure when clusters are overlapped. 
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Anna Maria Fiori and Anna Motta 


Abstract What determines the size distribution of business firms? What kind of 
firm dynamics may be underlying observed firm size distributions? Which candidate 
distributions may be used for fitting purposes? We here address these questions from 
a stochastic model perspective. We construct a firm dynamics process that leads to 
a Dagum distribution of firm size at equilibrium. An empirical study shows that the 
proposed model captures the empirical regularities of firm size distributions with 
considerable accuracy. 


Keywords Firm dynamics - Gibrat’s law - Dagum distribution 


1 Introduction 


The size distribution of business firms is one of the oldest and most relevant concepts 
in Industrial Organization studies. Knowledge of this distribution is fundamental 
for a number of reasons. Public policies often target small and mid-size firms 
whose growth is thought to have a beneficial effect in generating new employment 
opportunities. In contrast, the growth of large businesses is challenged by antitrust 
legislation due to questions of market power and unfair competition [2]. Managers 
operating in young (and hence often small) firms are aware that their companies 
must grow at a rapid pace to become productively efficient and survive. At different 
stages of their development, firms face different growth opportunities and must 
decide upon which of these opportunities should be taken up or discarded. 
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The main message that has emerged from approximately first century of research 
in firm size distribution (FSD) is that the random element is prevailing across the 
growth experiences of firms [7]. For this reason, the literature has progressively 
shifted away from theoretical models that view firms as perfectly rational profit- 
maximizing entities. Attention is now focused on stochastic processes that capture 
the empirical regularities of the FSD, thus providing a more realistic description of 
the dynamics of the economic system (see, e.g., [1, 2, 7]). 

In this work we contribute a new stochastic model of firm dynamics that leads 
to a Dagum distribution for the size of business firms operating in a given industry. 
The model builds on a stochastic growth process that was originally introduced in 
the context of income inequality studies [3, 6]. We here propose and empirically test 
an alternative parameterization. This sheds new light upon the connections between 
growth dynamics and the meaning of parameters that appear in the steady-state 
distribution of firm size. 

The rest of the paper is organized as follows. In Sect.2 we define a general 
stochastic framework for the study of firm growth and we introduce the Dagum 
distribution as a response to the main “stylized facts” about the FSD. In Sect. 3 
we test the Dagum model on a dataset of Italian companies. Our findings and their 
implications are discussed in Sect. 4. 


2 Model 


Denote by X(t) the size of an economic unit (firm) at time t > O and by Y(t) 
its natural logarithm. A central mechanism for explaining the dynamics of X (t) is 
multiplicative growth subject to random fluctuations [7]. This mechanism can be 
formulated by a doubly continuous (in time and states) stochastic process: 


dY; = g(y, thdt+ u(y, t)dB; (1) 


where g(y, f) is the infinitesimal drift coefficient, v(y, t) > 0 is the infinitesimal 
variance (diffusion coefficient), and B(t) is a standard Brownian motion. Here, the 
drift term reflects the impact of deterministic forces on the instantaneous growth 
rate dY;, while the Brownian motion fluctuations account for stochastic influences 
associated to uncertainty and risk factors [1]. 

The earliest and most influential model of type (1) was introduced in the 1930s 
by Gibrat [10], who viewed Y(t) as the outcome of a large number of small additive 
influences, independent of each other and identically distributed with mean pz and 
variance o~. These assumptions imply that all firms in a given industry face the same 
distribution of growth rates independent of their size, a property that Gibrat called 
the Law of Proportionate Effect. Nesting Gibrat’s Law into the general stochastic 
framework (1) gives an unrestricted Wiener process: 


dY; = pdt + odB,. (2) 
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This is a Gaussian process with E[Y (t)] = wt and Var[Y(t)] = o°t, both of which 
increase linearly with ¢ (see, e.g., [12]). Hence Gibrat’s Law implies a Lognormal 
distribution for the size of business firms, X(t) = exp[Y (f)], but this distribution is 
only transitional since its variance keeps growing over time. 

Kalecki [11] observed that such growth was not characteristic of actual size 
distributions and in 1945 he amended Gibrat’s model by postulating a negative 
correlation between growth rate and size. Based on the idea that large businesses 
face impediments to grow, Kalecki’s model can be formulated as a mean reverting 
Ornstein-Uhlenbeck process [12] leading to a steady-state distribution for company 
size X that is Lognormal. Here, the Lognormal model emerges as a proper FSD and 
persists in the steady state as a consequence of “impeded growth.” 

The first applications of the Lognormal distribution were carried out by Gibrat 
and Kalecki themselves, and the goodness of fit they obtained for mid-size 
manufacturing establishments (respectively, in France and in the UK) was striking 
[14]. Starting in the 1980s, however, the availability of more complete datasets 
and the rise of computing technologies have gradually revealed the existence of 
a number of statistical regularities, or “stylized facts” that challenge the Gibrat- 
Kalecki model. In particular, 


Fact 1 Smaller firms grow relatively faster than their larger competitors. In sam- 
ples including small businesses, a negative relationship between firm size 
and expected growth has been repeatedly documented for manufacturing 
firms [2]. However, from a certain size onward, firms tend to experience 
constant returns to scale and thus have the same growth chances [14]. 

Fact 2 Fluctuations in growth rates decrease with firm size. Smaller firms tend to 
display a larger growth rate variance, which is plausible if one thinks that 
their basic structure is less diversified than that of big firms [7]. 


To incorporate these facts into the stochastic framework of Eq. (1), we introduce 
a generalized process of firm growth in which the drift and diffusion coefficients are 
explicitly modeled as functions of firm size. This process was originally discovered 
by Fattorini and Lemmi [6] in the context of income inequality studies and is 
reformulated here in a new parameterization. 

Denote by f(y) = lim:+of(y,t) the steady-state density associated to the 
general stochastic process (1). If it exists, f(y) satisfies the Forward Kolmogorov 
Equation: 


a2 


a [POrFoy], 


NIle 


a 
0= —5y BO FOI + 
y 


where g(y) = lim:—+oog(y, t) and v*(y) — lim;+cov"(y, t). In accordance with 
Fact 1, we incorporate size dependence in the drift term by: 


2 
a(y= -— [1 = (ps pe) 3) 
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where a,b, p, and o are positive parameters. Based on Fact 2, the infinitesimal 
variance of Y is greatest at the lower end of the size scale. Hence, we specify the 
diffusion coefficient by: 


v(y) = 0? [1+ eaten | (4) 


Fattorini and Lemmi [6] have shown that this process has a steady-state density for 
Y which is a Type I Skew Logistic: 


ea(y—logb) 


I= carrer fe 40—boey] FP 


for y € WR, (5) 


with parameters log b (location), a (scale), and p (shape). It immediately follows 
that the equilibrium distribution of firm size, X = exp(Y), is given by a Dagum 
random variable with density: 


for x > 0, (6) 


where a and p now play the role of shape parameters, and b is a scale (see, e.g., 
[3] and [13] for a detailed history of this multi-discovered distribution, its properties 
and its many parameterizations). 

The Dagum density (6) is regularly varying at infinity with tail exponent —a — 1 
[13]. Thus smaller values of a imply that more probability mass is concentrated 
in the upper tail of the FSD. Interestingly, this can be related to the role of a as 
a scale parameter in the diffusion term (4): smaller values of a imply a higher 
instantaneous volatility of growth rates, particularly in the lower end of the size 
range (cf. Fact 2 above). This leads to a higher probability that firms grow to very 
large sizes. 

The shape parameter p affects the skewness of the steady-state density of Y in 
connection with the behavior of the drift term (3). For p = 1, the drift is constantly 
equal to —ao*/2 and the Type I Skew Logistic (5) reduces to a conventional, 
symmetric Logistic density. This situation corresponds to a weaker form of Gibrat’s 
Law in which the limiting mean of the (infinitesimal) growth rate is independent 
of firm size and negative (reflecting a stability condition explained, e.g., in [7]). 
Values of p > 1 imply a bounded, monotonic drift function that approaches —ao*/2 
from above as y increases. The corresponding density for Y is positively skewed, 
reflecting a tendency of smaller firms to grow on average faster than their larger 
counterparts (cf. Fact 1 above). 


Stochastic Models for the Size Distribution of Italian Firms: A Proposal 115 
3 Results of Empirical Studies 


We tested the Dagum distribution on a 6-year panel of annual observations of 
total assets (size variable) of Italian companies operating in the Information and 
Communication Technologies (ICT) industry.! The industry includes NACE Rev. 2 
Divisions 61 (telecommunications), 62 (computer programming, consultancy, and 
related activities), and 63 (information service activities), where NACE Rev. 2 is a 
classification of economic activities in the European Union managed by Eurostat. 

Previous studies of the size distribution of ICT firms in Italy [8] rejected the 
hypothesis of lognormality due to a regularly varying upper tail. It is consequently 
worth investigating whether the Dagum distribution could be a suitable candidate 
for fitting purposes. Our analysis is illustrated for the logarithmic size variable Y 
whose characteristics are easier to visualize. The implications for the absolute size 
variable X = exp(Y) are immediately deduced. 

The 18, 476 companies in our dataset have minimum total assets of 1000 Euros 
and were active in every year of the sample period, from 2010 to 2015. The choice of 
concentrating on relatively long-lived firms is consistent with our use of stochastic 
models that focus on the steady state for a closed population of firms. However, as 
argued in [4], such models have also practical relevance for understanding industries 
with entries and exits as long as these cancel out approximately. 

The boxplots in Fig. 1 summarize the basic year-by-year information about 
central tendency, spread and possible outliers in Y. 

The Normal and Type I Skew Logistic distributions were fitted to the log- 
size variable Y by Maximum Likelihood (ML) on a year-by-year basis (Table 1), 
yielding parameter estimates that appear fairly stable over time. Focusing on the 
shape parameter p, we carried out a formal test of the null hypothesis that p = 1 
(symmetry) against the one-sided alternative that p > 1 (positive skewness). 
The test led to a strong rejection of Ho in all years of the sample period. In 
view of the role played by p in the drift coefficient (3), this finding may be 
interpreted as evidence of an inverse relationship between expected growth and size, 
in consequence of which the distribution of Y deviates significantly from symmetry 
towards a positively skewed shape. This is confirmed by a visual comparison of the 
Normal and Type I Skew Logistic curves with the empirical histogram of log-size 
data (Fig. 2). 

The similarity/discrepancy between the reference models and the empirical 
distribution of Y has been formally tested by two goodness-of-fit statistics: the 
Kolmogorov-Smirnov (KS) and the Anderson-Darling (AD) test (see, e.g., [8]). As 
shown in Table 1, the Normal distribution is rejected at all plausible significance 
levels, whereas the Type I Skew Logistic provides a very accurate description of Y 
in every year of the sample period. In particular, the Type I Skew Logistic fits the 
upper tail of the log-size distribution significantly better, as shown in Fig. 3. 


'Data source: Aida, http://aida.bvdinfo.com/. 
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Fig. 1 Year-by-year boxplots of the logarithmic size variable Y for ICT firms. The empirical 
distribution of Y appears fairly stable over time, with a considerable number of points lying outside 
the whisker ends and a systematically longer whisker on top. These features suggest the presence 
of positive skewness and persistent fat tails that are unlikely to be compatible with a Normal 
distribution 


A source of concern about these results is the impact of potential outliers 
represented by firms with total assets close to the minimum. On the one hand, 
small firms are the backbone of the Italian economy and nearly 10% companies 
in our dataset have total assets below 33,000 Euros. On the other hand, very small 
firms could have peculiar features (e.g. self-employment) that may lead to biased 
estimates. We carried out a sensitivity analysis by gradually removing fractions of 
very small firms from the dataset, respectively 0.5% (corresponding to firms with 
total assets below 6000 Euros) and 1% (total assets below 10,000 Euros). This led 
to a progressive increase in estimates of the skewness parameter p (Table 1, last two 
columns), suggesting that departures from normality in the log-size distribution are 
more pronounced when very small firms are excluded from the analysis. 


4 Discussion 


Ever since the work of Gibrat in 1930s, the Lognormal distribution has played a 
central role in studies of firm size distribution (FSD). A simple empirical test of this 
distribution consists in taking the natural logarithm Y of the firm size variable and 
comparing it to a Normal distribution. This comparison frequently reveals a poor 
fit, particularly in the tails (see, e.g., [9] for an interesting study of Gibrat’s Law by 
Italian macro-regions). 
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Fig. 2 Histogram of the log-size variable Y for year 2015: comparison with the normal and type I 
skew logistic curves fitted to data by maximum likelihood. The type I skew logistic fits the whole 
range of log-data with remarkable accuracy, capturing the presence of positive skewness and a 
heavy upper tail. The normal overestimates the frequency of small firms and underestimates the 
frequency of medium and large firms 
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Fig. 3 Focus on the upper tail of the log-size variable Y for year 2015: comparison of the empirical 
survival function for the 10% largest firms with the normal and type I skew logistic curves. The type 
I skew logistic always lies inside the 99% confidence bounds for the empirical survival function, 
whereas the normal curve severely underestimates the probability that large businesses occur 


As an alternative to the traditional assumption of normality for Y, we have 
proposed a stochastic model leading to a Type I Skew Logistic curve at equilibrium. 
This implies that company size, X = exp(Y), follows a Dagum distribution, a three 
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parameter curve that combines an interior mode with a regularly varying upper tail. 
Our empirical study has shown that the Dagum model fits remarkably well the size 
distribution of Italian firms operating in the ICT industry. 

As recently observed in [7], regular variation in the upper tail of the FSD may be 
interpreted as evidence that a weaker form of Gibrat’s Law of Proportionate Effect 
holds for large businesses. In contrast with Kalecki’s view, these businesses do not 
find impediments to grow in proportion to their size and, consequently, their growth 
opportunities are greater than a Lognormal distribution would predict. At the same 
time, the Dagum process implies that growth rate volatility is higher among smaller 
firms, which are nevertheless characterized by a tendency to grow on average faster 
than their larger competitors. In our study of Italian ICT firms, this tendency has 
been documented by results of statistical tests on the skewness parameter p that 
characterizes the drift of the Dagum process. 

Although preliminary and limited to a specific dataset, our findings suggest 
potential implications for Industrial Organization studies and policy intervention. 
In particular, public policies targeting small and mid-size firms could be useful to 
(partly) offset their growth rate volatility, consolidate their growth patterns, and 
possibly help them maintain permanent employments [2]. This view seems to be 
shared also by the European Commission, which has recognized a prominent role 
of small and mid-size firms as drivers of economic growth [5]. On the other hand, 
the absence of impediments to grow (documented by evidence of a Paretian upper 
tail in the FSD) raises some questions as to whether very large businesses should 
be monitored to prevent possible industrial concentration processes [8]. These 
questions deserve further investigation. In particular, a more detailed study is in 
progress to test the Dagum model on different industries and to extend its basic 
formulation with a view on modeling industries with entries and exits of firms. 
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Modeling Return to Education M®) 
in Heterogeneous Populations: crests 
An Application to Italy 


Angelo Mazza, Michele Battisti, Salvatore Ingrassia, and Antonio Punzo 


Abstract The Mincer human capital earnings function is a regression model that 
relates individual’s earnings to schooling and experience. It has been used to 
explain individual behavior with respect to educational choices and to indicate 
productivity on a large number of countries and across many different demographic 
groups. However, recent empirical studies have shown that often the population 
of interest embed latent homogeneous subpopulations, with different returns to 
education across subpopulations, rendering a single Mincer’s regression inadequate. 
Moreover, whatever (concomitant) information is available about the nature of such 
a heterogeneity, it should be incorporated in an appropriate manner. We propose 
a mixture of Mincer’s models with concomitant variables: it provides a flexible 
generalization of the Mincer model, a breakdown of the population into several 
homogeneous subpopulations, and an explanation of the unobserved heterogeneity. 
The proposal is motivated and illustrated via an application to data provided by the 
Bank of Italy’s Survey of Household Income and Wealth in 2012. 


Keywords Mincer’s earnings function - Mixtures of regression models 


1 Introduction 


Earnings functions are used by social scientists to explain individual behavior with 
respect to educational choices and to indicate productivity [38]. They provide an 
indicator of returns to schooling, typically in the form of projected future wages, 
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which helps individuals to decide how to invest in their own human capital [39]. 
These indicators have also been used to relate fertility decisions with opportunity 
costs; since rearing of children is time intensive, an increase in earnings may induce 
a negative substitution effect on the demand for children [35, 49]. 

Introduced by Jacob Mincer, a pioneer of the New Labor Economics, in his 
seminal work Schooling, Experience and Earnings, the “human capital earnings 
function” is arguably the most popular earning function [34]. It is a single-equation 
model that explains the natural logarithm of earnings as a linear function of years 
of education, years of potential labor market experience, and the square of years of 
potential experience; in formula, 


In (y) = # (x; B) + © = Bo t+ Bix + Boxe + Bsx3 +6, (1) 


where y denotes earnings, x; and x2 represent years of education and years of poten- 
tial labor market experience, ! while ¢ ~ N (0,o), with o being the (conditional) 
standard deviation of In (y). In (1), x = (x1, x2)’ and B = (Bo, B1, B2, B3)’. 

The Mincer equation owes its popularity to the straightforward interpretation 
of the coefficient 6; as approximated rate of return to education [8]. It has been 
examined on many datasets, involving a large number of countries and many 
different demographic groups and, as stated by Lemieux [28], it is “one of the most 
widely used models in empirical economics.” Within income inequality studies, it 
has been used to study wage differentials due to gender [50] and for predicting the 
wage that a self-employed worker in a certain sector of the economy would have 
received on average as a paid employee on the same sector of economy [2, 3]. 
The literature on educational mismatches uses different specifications of (1) for 
quantifying the effect of educational mismatch on wages [37]. 

However, recent empirical studies [6, 23] have shown the relevance of unob- 
served heterogeneity. That is, often the populations of interest are constituted by 
latent groups bearing different returns to educations and characterized by different 
socio-demographic profiles, in a way that regression coefficients (and dispersion 
parameters) cannot be assumed to be the same for all observations, making the use of 
a single Mincer’s regression inadequate. Heterogeneity and segmentation has been 
shown for Italy by Battisti and Cipollone [6, 14]. In fact, the Italian labor market has 
been traditionally characterized by rigid institutions, with employment protection 
legislation imposing strict rules and constraints regarding the ability to hire and fire 
workers. Reforms that started at the end of the 1990s have increasingly introduced 
more flexibility, but these new rules mostly apply only to newly hired workers; this 
has led to the formation of a two-tier labor market (see, e.g., [9]). 

Finite mixtures of linear regression models, introduced by Quandt and Ram- 
sey [48] in the general form of “switching regression,” constitute a reference 
framework of analysis when no information about group membership is available 
and the modeling aim is to find groups of observations with similar regression 


'See [12, 13] for a discussion on the use of polynomial terms for experience. 
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coefficients. Furthermore, whatever (concomitant) socio-demographic information 
is available about the nature of such a heterogeneity, it should be incorporated in the 
model in an appropriate manner. 

To deal with these issues, in Sect. 2 we introduce a finite mixture of Mincer’s 
regressions with concomitant variables: the proposal simultaneously provides a 
flexible generalization of the Mincer regression, a breakdown of the population 
into several homogeneous subpopulations, and an explanation of the unobserved 
heterogeneity also based on the considered concomitant variables. The EM algo- 
rithm is used for parameter estimation and BIC is adopted to select the number of 
subpopulations. In Sect. 4, the model is applied to disposable household income, as 
obtained from the Bank of Italy’s Survey of Household Income and Wealth (SHIW) 
in 2012. In addition to illustrate the use of the model, this application demonstrates, 
based on the BIC, how a single Mincer’s regression is inadequate for these data. 


2 The Model 


Given a d-dimensional vector W of concomitant variables (individual characteris- 
tics), and based on [16], we propose to generalize Eq. (1) via a finite mixture of 
k Mincer’s regressions with concomitant variables; being a mixture, the proposed 
model can be defined from the conditional density of In (y), given x and w, in the 
following way 


k 
p(in(y) |x, w; 8] = > 7; (w; ew) @ [In (y) |x; w (x; Bj). o;], (2) 


j=! 


where zr; (w; o&) are positive weights (depending on the parameters w) summing to 
one for each w, @(-; 4, 0) denotes the density of a Gaussian random variable with 
mean yj and standard deviation o, (u(x; Bj) is defined as in (1), and # contains all 
of the parameters of the model. The multinomial logit model 


JU; (Ww; a) = xray 00) / J extn +) (3) 
is assumed for the mixture weights in (2), where w@j; = (@j1,..., aja)’, aj = 
(ajo, w i) € € IR¢+! anda = (w',..., 0)’, with w; = 0 for identifiability sake 


[22, 24]. 

Model (2) can be used as a powerful device for clustering by assuming that 
each mixture component represents a group underlying the overall population [33]. 
Advantageously, by means of the mixture weights in (3), the concomitant variables 
w can be used to explain the profiles of the different groups. Here, it is important to 
stress that, based on the mixture model (2), we do not specify the groups a priori, 
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but we let the data identify homogeneous groups with respect to the relationship 
between In (y) and x. For alternative uses of covariates and concomitant variables 
in mixtures of regressions models, see [7, 15, 25, 26, 32, 41-44, 46, 51, 52, 56]. 


3. Maximum Likelihood Estimation: The EM Algorithm 


To find maximum likelihood (ML) estimates for # in (2), we adopt the EM algorithm 
of [17], as implemented by the stepFlexmix () function of the flexmix package 
[22] for R. In detail, given a random sample (y1,x/,w{)’,..-, Qn, X),, w),)’ of 
(Y, X, W) from model (2), and once k is assigned, the algorithm basically takes 
into account the complete-data log-likelihood 


n k n k 
le (8) = SY S0 zi In[xzj (wis &)] + ¥> YS ij n[p Qilxis Bj.0j/)], @ 


=) j=) i=1 j=l 


where z;; = 1 if (yi, x4, wi)’ comes from component j and z;; = 0 otherwise. 
The EM algorithm iterates between two steps, one E-step and one M-step, until 
convergence; their schematization, with respect to model (2), is given below (see 
(55, pp. 120-124] for further details). 


E-step: Given the current parameter estimates #” of the rth iteration, each Zij 1s 
replaced by the estimated posterior probability 


ap =i (wis a) ~ [in (vi) eis (xi; BY”) ; oe? J [in (yi) xi, Wis 0 
(5) 
(r) 


M-step: the obtained values of Zao which are function of #“ ) are substituted 
to z;; in (4) so leading to the expected complete-data log-likelihood 
which is maximized with respect to #, subject to the constraints on these 
parameters. 


The EM algorithm described above needs to be initialized. Among the possible 
initialization strategies [4, 27], in the real data analysis of Sect.4 a random 
initialization is repeated ten times from different random positions and the solution 
maximizing the observed-data log-likelihood among these ten runs is selected (see 
also [5, 29, 45, 47]). 

Once the model is fitted, we can estimate the posterior probabilities (MAP) of 
group membership, say Z;;, based on (5). Hence, each individual can be assigned to 
one of the k groups via the maximum a posteriori probability operator MAP (Zi ne 
assuming value 1 if maxp=1,....¢ {Zin} occurs at component j and 0 otherwise. 

Finally, to select the number of mixture components k, we adopt the Bayesian 
information criterion BIC = —2/(#) + mInn, where m is the overall number 
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of parameters in the model and 1d) denotes the maximized observed-data log- 
likelihood. 


4 Real Data Analysis 


We use data provided by the Bank of Italy’s Survey of Household Income and 
Wealth (SHIW), reporting several socio-economic characteristics of Italian house- 
holds for 2012. The SHIW is a biannual survey on the microeconomic behavior 
of Italian families, with a sample of approximately 8000 households per year. It 
contains information both on households (family composition) and on individuals. 
Moreover, it provides detailed information on several characteristics of the worker, 
such as net yearly earnings, average weekly hours of work and number of months 
of employment per year,” educational attainment (the highest completed school 
degree), job experience, gender, marital status, sector of employment, composition 
of his/her family, parents background, regions of residence, and town size. 

We consider adult heads of the family aged 19 or over, full time and part time 
employees, working either in the public or in the private sector+ and such that 
information about earnings are available; this yields a total of 3141 individuals. 

For each head of family we consider the following variables: 


Y (Earnings): In its original formulation, the Mincer equation refers to the 
hourly price of labor as correct measure of worker’s earnings.” 
SHIW contains annual earnings net of taxes and social security 
contributions, average number of hours worked per week, and 
number of months of employment per year. Based on these 
quantities, and as used by most empirical studies,° hourly 
wages are defined as 


yearly net earnings 


months worked x weekly hours worked x 4° 


Hourly wages are defined as: yearly net earnings/(months worked x weekly hours worked x 4). 


3Standard and not actual year of formal schooling are recorded. Since students who fail to reach a 
standard have to repeat the year, the actual number of years is likely to be underestimated. 


4We exclude self-employed because of the low reliability of their declared earnings. 


5Monthly or annual wages would in addition capture the effect of individual’s decisions on working 
hours. Given the only weak positive correlation between working time and educational attainment, 
it is reasonable to assume that the choice of hours worked reflects individual preferences rather 
than educational levels. 


Notice that hourly measure of earnings can be affected by measurement errors due to the fact that 
we calculate hourly wages as total earnings divided by hours of work; for instance, there might be 
part-time workers that do 2 weeks a month committing the whole day. 
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X (Education): | Education is generally measured by the number of years spent 
at school. SHIW does not contain information about this num- 
ber, but only on the highest degree attained by individual. 
Following a common approach in literature [10, 54], we calcu- 
late the educational attainment of the individual by imputing the 
number of years required to complete her/his reported level of 
educational attainment.’ More precisely, we consider that the 
(statutory) numbers of years required to obtain a primary and 
a junior school certificate is 5 and 8 years respectively; instead, 
for the upper secondary school the number of years ranges from 
11 (vocational or technical school) to 13 (classical or scientific 
studies); finally, for tertiary education, we consider 16, 18, and 
21 years for the university diploma, the college degree, and the 
postgraduate degree, respectively. 

X2 (Experience): Many empirical studies use age as a proxy for the (working) 
experience of individuals. But this choice can be severely 
biased, especially for young cohorts. Other authors use poten- 
tial experience, defined as the difference between the current 
age and the age at the labor market entry, but they ignore 
the possibility of unemployment or underemployment, again 
a crucial feature for young cohorts. Here we use as proxy 
for experience the number of years for which a worker has 
been paid social security contribution; they should reflect the 
effective years of training on the job and learning-by-doing 


activities. 

W, (Gender): gender, a dichotomous variable assuming values “Male” and 
“Female.” The former is chosen as reference category. 

W? (Area): area, a nominal variable with values “North,” “Center,’ and 


“South.” The former is chosen as reference category. 
W3 (Citizenship): citizenship, a dichotomous variable assuming values “Italian” 
and “Foreign.” The former is chosen as reference category. 


Previous studies have shown that native workers are likely to receive higher 
returns than immigrants [23], males to receive higher returns than female, and 
workers in the North of Italy to receive higher returns than workers in the other 
regions [6, 14]. We wanted our model to incorporate this knowledge, so we selected 
W, (Gender), W2 (Area), and W3 (Citizenship) as concomitant variables. Their role 
is, by means of the first term in (5), to drive individuals in the latent subpopulation 
in which individuals with similar socio-demographic status belong to. 

Model (2) is fitted to the data for values of k ranging from 1 to 10; the 
model corresponding to the lowest BIC value has k = 3 components, and the 
corresponding estimated parameters are reported in Table 1. As the dependent 


7Standard, not actual, years of formal schooling are recorded. Since students who fail to reach a 
standard have to repeat the year, the actual number of years is likely to be underestimated. 


Modeling Return to Education in Heterogeneous Populations: An Application to Italy 127 


Table 1 Parameters estimates for model (2) with k = 3 


ee re eee 
Covariates 150716 
Bj, (Education) 0.02752 
0.02161 
=0,00012 | =0.00025 
oj *0T5445 | 0.27891 | 0.19186 
Concomitant variables 2.58453 
=1.37438 | —0.50159 
0.37758 | =0.36252 
=1,35943 
“1.31669 
Relative size of the groups 0.53709 


Bold style highlights regression coefficients significantly different from zero (significance level 
equal to 0.05) 





Table 2 Conditional frequencies of group membership given the categories of the concomitant 
variables 


Group 
Group relative size 
0.18812 0.05285 
2 0.00330 | 0.41006 
3 0.80858 [0.53709 
Total 1.00000 





variable is the log of earnings, the estimation results show that, on average, one 
additional year of education increases earnings by around 3.844% for individuals 
in the first group, 6.528% for individuals in the second group, and 2.752% for 
individuals in the third group. This means that 41% only of the workers receive 
a good return from education’s investment. Note that the average returns found by 
Brunello et al. [11] are 4-5 %, so that we may see these previous results as a weighted 
average of our clusters’ estimations. 

The concomitant variables in the mixture model allow us to characterize the 
profile of the groups. As an example, the negative and significant coefficient 
associated to the Female category of the Gender variable for the second group tells 
us that being a woman makes less likely to belong to the second group than to the 
first (more precisely, its logit decreases by — 1.37438). 

Using the MAP operator, we can assign each subject to a group. Table 2 reports 
the conditional frequencies of group membership given each of the categories of 
the concomitant variables, while Table 3 reports the conditional frequencies of the 
categories of the concomitant variables given the groups. 

The second group, which is the one with the highest return to education, is 
characterized by a disproportionately high presence of males; in fact, whereas 
50.159% of all men belong to the second group, only 19.444% of women do 
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Table 3 Conditional frequencies of the categories of the concomitant variables given the groups 


Gender Area Citizenship 
Group Male Female North Center South Italian Foreign 
1 0.52410 | 0.47590 | 0.30723 | 0.16867 | 0.52410 =| 0.65663 | 0.34337 
2 0.85870 | 0.14130 | 0.49457 | 0.23137 | 0.27407 =| 0.99922 | 0.00078 
3 0.59988 |0.40012 | 0.44221 | 0.18020 | 0.37759 | 0.85477 | 0.14523 


Sample | 0.70201 |0.29799 |0.45654 | 0.20057 | 0.34288 | 0.90353 | 0.09646 


(Table 2). On the other hand, women are overrepresented in the third group, the 
one with the lowest return to education, where they account for 72.115%, whereas 
men are only 45.896% (Table 2). These gender differences are consistent with the 
findings of [18, 19]. Apart from discriminatory attitudes and specifically employers’ 
unwillingness to invest in training female workers, gender differences have been 
ascribed to the different labor participation pattern between men and women arising 
from an intermittent female labor supply that may erode job skills [36]. Also, as 
pinpointed by Mincer [35], the decline in mortality that initiated the demographic 
transition increased incentives in investing in human capital, because the rise 
in longevity implied a higher profitability of investing in children’s education 
[30, 31, 40]. At a later stage of the demographic transition, the costs of raising 
children increased with the increase in the cost of time, producing “an apparent 
trading of numbers of children for their quality” [35]. Note that mothers traditionally 
have a major role in raising children, so the opportunity cost for child care is 
greater for more educated women, especially where public daycare services are less 
developed as it is the case of Southern Italy [49]. 

Foreign workers, who embody 9.645% of the sample, are practically not present 
in the second group, whereas they account for 14.523% of the third group (Table 3). 
Evidence for a lower return on education for foreign immigrants in Italy is found in 
[1]. Using data on immigrants in Israel, [20] shows how once in the new country, 
immigrants tend to accept any kind of job, even ones in which they cannot fully 
exploit their human capital and skills. Workers in the South of Italy (34.288% of 
the sample) appear underrepresented in the second group (27.407%), while they 
account for more than a half of the first group (52.41%; see Table 3). Returns to 
education, so, appear slightly lower in the South of Italy; this is consistent with the 
findings of [18]. 


5 Conclusions 


The Mincer human capital earnings function provides the most popular indicator 
of return to education. However, empirical studies have shown that populations of 
interest are often constituted by latent groups bearing different returns to educations 
and characterized by different socio-demographic profiles. In this paper, we propose 
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a new mixture-based approach to make the Mincer earnings function more flexible 
with respect to unobserved heterogeneity. The proposed model allows to estimate 
the density of the income distribution, to detect homogeneous subpopulations, and 
to analyze the position of individuals with specific characteristics. The method is 
illustrated using data provided by the Bank of Italy’s Survey of Household Income 
and Wealth in 2012. Our empirical results demonstrate that this method can be 
successfully used in practice. 

Note that within specific applications, the Mincer function has been extended 
in different ways, notably to deal with the potential endogeneity of schooling [53], 
non-linear education premiums [21], heterogeneity in human capital within educa- 
tional levels or work experiences [37], and cohort effects [23]. We acknowledge 
the relevance of these issues, although here, to keep our exposition simple, we 
focused on the classical Mincer equation. However, when required from the context 
of the application, these extensions can be easily incorporated within the framework 
proposed. 
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Changes in Couples’ Bread- Winning ®) 
Patterns and Wife’s Economic Role od 
in Japan from 1985 to 2015 


Miki Nakai 


Abstract The trend towards dual-income families can be detected in recent years 
in many industrialized countries. However, despite the continuing rise in Japanese 
women’s rates of participation in the economy over the period of industrialization 
and beyond, the notion of gendered division of labour has been seen as “normal” 
in Japanese society. The aim of this paper is to examine whether the determinants 
of married women’s labour force participation have changed over the past several 
decades. Based upon social survey of national sample in Japan conducted in 1985, 
1995, 2005, and 2015, we analyse the income provision-role type of the dual-income 
couples and examine change/stability of the factors that differentiate couples where 
the husband provides the majority of the couple’s income from equal providers. 
We find the changing effects of women’s own human capital on contribution to 
household income. On the other hand, the division of labour within households has 
not changed a lot over the past several decades. 


Keywords Gender division of labour - Male breadwinner - Wives’ economic 
dependency 


1 Introduction 


1.1 Background 


A clear division of paid and unpaid work along gender lines is found in every 
country of the world, but the trend towards dual-income families can be detected 
in recent years in many advanced industrial societies. 
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However, despite the continuing rise in Japanese women’s participation in the 
economy as well as many Western societies, gender division of labour has been 
accepted as “normal” and still strong. While the number of households with wives 
entirely dependent on their spouses’ income has dramatically declined, most women 
in dual-income couples still earn much less than their spouses, and households in 
which wives earn equal to or more than their husbands have been very few. 

As gender inequalities in the division of labour at home are closely related 
to gender inequalities in other spheres of life, particularly in the labour market, 
understanding what determine the division of labour within Japanese couples is key 
to understanding other aspects of gender stratification. Many studies have argued 
that women’s economic dependency on men is an important attribute of stratification 
systems and essential force in the maintenance of gender inequality (e.g. [12]). 

The aim of the present study is to examine how couples’ bread-winning patterns 
such as male-breadwinner couple, equal-provider couple, or female-breadwinner 
couples, relate to individual characteristics and how these associations have changed 
over time in Japan. Dual-income couples might not necessarily mean liberate 
women from their traditional gender role. There might be quite a gap between the 
households that wife’s employment is perceived as secondary to her husband’s and 
other households that have more symmetrical roles, or a more balanced sharing 
of responsibilities within the marriage. Therefore, the analysis places emphasis on 
what differentiates equal-provider couples from male-breadwinner couples among 
dual-income couples. In the following section, we describe several hypotheses 
related to couples’ bread-winning patterns. To examine those hypotheses we 
perform multinomial logistic regression on the survey data collected in Japan. In 
Sect. 2, we describe the data and the variables of interest. In Sect. 3, we present the 
results of our analysis. In Sect. 4, we conclude the paper. 


1.2. Hypotheses 


Based on some previous studies, hypotheses are as follows. 


Human Resource Hypothesis Women’s improved educational opportunities are 
thought to boost female labour force participation in many countries. Also, it is 
considered that more females of the recent cohorts enter the labour market than those 
of older cohorts due to expanding access to higher education. Therefore, we first 
hypothesize that women’s education may have positive effects on women’s share of 
household income and therefore being equal provider. 


However, the effect of a woman’s educational attainment on her employment 
has not been significant in Japan (e.g. [2]). Our previous study also supported the 
notion that women are highly educated but typically barred from making full use of 
their education in economic and political fields up to the present [7, 8]. Having said 
that, woman’s human resources might positively be associated with an increased 
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likelihood that she is an equal provider relative to a secondary provider, once she 
overcomes the first hurdle, or resignation due to marriage or child birth. 


Supplement Household Income Hypothesis Secondly, husband’s _ socio- 
economic status may have negative effects on married women to become equal 
provider (e.g. [13]). Married women may be more likely to enter the labour market 
when the husband’s income is low in order to supplement household income. 
According to past empirical research, women were employed in paid work for 
economic necessity, and it is not until the 1970s that women started to pursue 
careers [4]. 


However, there has been significant diversity in the impact of husbands’ 
resources on their spouses’ employment since around the end of the twentieth 
century and distinct differences in the impact of husbands’ resources on their 
spouses’ employment behaviour correspond to the welfare state regimes (e.g. 
[1, 3, 10, 14]). For example, in conservative continental European welfare states 
such as Italy, France and Germany, the association is negative as it used to be; 
for men with high occupational resources to suppress spouse’s participation in 
paid work, showing the traditional division of labour in couples and increasing 
dependency of married women on their spouses over the life course. In social 
democratic welfare states, on the other hand, male’s occupational resources increase 
women’s labour market activity (positive association). Positive effect implies that 
economic resource at the household level facilitates a woman’s employment also 
because it helps balancing work and family. More and more advanced postindustrial 
economies see the positive effects of husband’s occupational resources on their 
partner’s participation rates in recent years. Women married to well-educated 
husbands as well as women with high-income partners are less likely to leave the 
labour market than women with low-resource partners. 


Values Hypothesis Thirdly, we also hypothesize that values and attitudes toward 
the family and gender roles may affect couple’s bread-winning patterns. We 
hypothesize that gender egalitarian attitudes are positively associated with the 
probability of being in an equal-provider couple. For example, given that younger 
cohorts are more egalitarian than older cohorts, it may lead to the rise in equal- 
provider among younger couples. Inglehart and Norris [5] argue that the twentieth 
century gave rise to profound changes in traditional sex roles, but that the force of 
this “rising tide” has varied among rich and poor societies. They demonstrate that 
richer, postindustrial societies support the idea of gender equality more than agrarian 
and industrial societies and intergenerational differences in values are largest in 
postindustrial societies and relatively minor in agrarian societies, suggesting that 
the former are undergoing intergenerational changes in values. They also argue that 
cohort change in gender-role attitudes in postindustrial societies is unidimensional, 
with newer cohorts consistently more egalitarian than older cohorts. 


We also hypothesize that values related to the household context may influence 
gendered arrangement for work and care in the household. We focus on the degree 
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of educational homogamy.' Whether or not the couple is homogamous seems to 
be associated with a patriarchal culture. More patriarchal households, which may 
be associated with female hypergamous couples, may prefer traditional marriage 
practice; women are expected to fulfil the roles of wife and mother, and men are 
expected to be the chief provider. These asymmetric gender relation within the 
marriage may influence couples’ preferences for bread-winning type, as well as 
their relative power within the marriage [11]. 


Lifestage Restriction Hypothesis The burden of child-rearing poses a formidable 
obstacle to women’s professional ambitions and have women accept a secondary 
provider role within household. Even though gender equality matters in many 
societies, most research found that women have had primary responsibility for 
household chores, as well as caring for their children. Our previous research also 
showed that the presence of preschool children has strong negative influence on 
wife’s labour participation. 


2 Data and Methods 


Data for the present study were obtained from the past three decades of four waves 
of cross-sectional data: the 1985, 1995, and 2005 Social Stratification and Social 
Mobility (SSM) surveys of Japanese society, and the 2015 Stratification and Social 
Psychology (SSP) survey in Japan. All the surveys were conducted with similar 
approach: face-to-face interviews with a special focus on social stratification and 
inequality in contemporary Japan. All the surveys selected national representative 
respondents through multiple-stage sampling. The subjects of these surveys were 
men and women, aged between 20 and 69 for the surveys in 1985, 1995, and 2005, 
and between 20 and 64 for the 2015 SSP survey. Data were collected from 1248 
men and 1405 women in 1985, 2490 men and 2867 women in 1995, 2660 men and 
3082 women in 2005, and 1644 men and 1931 women in 2015. The response rates 
were 67.9%, 66.0%, 44.1%, and 43.0% in 1985, 1995, 2005, and 2015, respectively. 

To make data comparable across the four datasets, we limit our analysis to 
the householders and their spouses, where wife’s age is between 25 and 54. The 
available data refer to 9067 respondents (994 in 1985, 3180 in 1995, 2862 in 
2005, and 2031 in 2015). Using multinomial logistic regressions we analyse how 
individual- and household-level characteristics are associated with each of the 
three dual-income types. We estimate the effects of the correlates on the odds of 
being equal-provider or female-breadwinner couples (reference category is male- 
breadwinner couples) in each of the four waves. 


'Homogamy is defined as marriage of both the husband and wife having similar levels of 
educational attainment. Hypergamy is when the wife is less educated than the husband, and 
hypogamy is where the wife is more educated than the husband. 
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Dependent Variable We focus on within-couple inequality in the household. We 
use a concept of wives’ contribution to household income as an aspect which reflects 
within-couple inequality in the household, which is defined (a) income provision- 
role type, and (b) wives’ contribution to total household income. In the present study, 
we analyse (a) income provision-role type as a dependent variable. 


Income provision-role type is measured based on whether a dominant provider 
exists and identifies who she/he may be. We use a five group classification: (1) 
husband sole provider, (2) husband provides majority, (3) equal providers, (4) wife 
provides majority, (5) wife sole provider [9]. Although we first show the distribution 
of five-category household type in Table 1, we restrict the subsequent analysis to 
the couples of the three dual-income groups (2, 3, and 4) to estimate a multinomial 
logistic model for examining the factors associated with the equal-provider couples 
and female-breadwinner couples as opposed to male-breadwinner couples. This 
restriction reduced sample to 4024 (420 in 1985, 1465 in 1995, 1112 in 2005, and 
1027 in 2015). 


Independent Variables To capture the effects of human resources of women, we 
include wife’s education. Wife’s education is collapsed into four categories: (1) 
less than a high school, (2) high school, (3) 2-year college, and (4) 4-year tertiary 
education or more, where high school is the reference category. Wife’s age is coded 
into six categories: (1) 25-29, (2) 30-34, (3) 35-39, (4) 40-44, (5) 45-49, and (6) 
50-54, where 30-34 years is the reference category. 


Married couples division of labour may vary systematically also with regards to 
household-level characteristics. The household level explanatory variables include 
age and the number of children within a household, husband’s income, and the 
couples’ relative education. The number of children has four categories: (1) no, (2) 
one, (3) two, and (4) three or more children, with ‘no’ as reference category. The 
presence of a preschooler is coded as a binary variable with respect to children’s age 
0-6, with no preschooler as reference category. Husband income level is measured 
by income decile (ten groups) in each survey year. The couples’ relative education- 
level variable measures whether wife has higher or lower education than her spouse 
and has three categories: (1) husband and wife have equal education, (2) hypogamy, 
and (3) hypergamy, where equal educational level is the reference category. 


3 Results 


We first examined the division of paid and unpaid work between spouses within 
households. Table 1 shows how bread-winning patterns and average wife’s eco- 
nomic contribution have changed over the past three decades. Wife’s contribution 
to household income, which is the percentage of income contributed by wives, was 
calculated by respondent’s and spouse’s annual incomes. Although an overwhelm- 
ing majority (70%) of couples were dual-income by the year 2015, most of them 
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Table 1 Trends in percent distribution of household types of couples and wife’s economic 
contribution to household income: 1985-2015 


1985 1995 2005 2015 

Household type 

Husband sole provider 42.8% 42.0% 41.3% 29.0% 

Husband provides majority 46.8% 47.7% 44.7% 51.4% 

Equal providers 8.9% 8.7% 11.2% 14.1% 

Wife provides majority 1.6% 1.2% 1.9% 5.2% 

Wife sole provider 0.0% 0.4% 0.9% 0.3% 
Wife’s economic contribution 

All age 14.0 15.1 18.6 25.6 

Wife aged between 25-54 15.1 14.9 17.8 23.1 


are male-breadwinner couples, which is often be considered to be associated with 
low gender egalitarian attitudes. The table also shows that equal-provider couples 
are only 14% in Japan even in 2015. 

We present the results of multinomial logistic regression in Table 2.” 

First, our hypothesis that wife’s own education would be positively associated 
with the probability of being in an equal-provider couples among dual-income 
couples was supported. Having a college education heightened a wife’s likelihood 
that she was an equal provider relative to a secondary provider, compared to 
women with a medium level of education, on the condition that the married 
couple households with college educated wives are dual-income since the mid- 
1990s. Although tertiary education has not been positively associated with women’s 
participation in paid work in Japan, when we focus on dual-income couples, 
women who have invested more in their own human capital less readily settle for a 
secondary provider role than women who have invested less in their human capital 
accumulation since around the end of the twentieth century. 

Second, the effects of husband’s income have been remarkably significant and 
remain constant over time. Women who have husbands with low annual income 
are more likely to report being in an equal-provider couple, as opposed to a male- 
breadwinner couple. This suggests that women’s participation in paid work in Japan 
is primarily driven by economic necessity, in fact, the purpose of keeping the level 
of household income, rather than pursuing careers. 

Third, we do not find consistent significant association between age and the 
probability of being in an equal-provider versus a male-breadwinner couple. We 
expected a degree of gender egalitarian values would be reflected in gender equality 
in couples’ earnings structures. However, this was not supported by our findings 
and it is still not normative for young married women to share equally in providing 


Because our primary interest is to understand what factors differentiate equal-provider from male- 
breadwinner couples, the part with regards to the estimated coefficients affecting the probability to 
belong to female-breadwinner instead of male-breadwinner couples is not shown in Table 2. 
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Table 2 Multinomial logistic regression estimates: 1985-2015 


Equal vs. male breadwinner 


2015 

Age (ref: 30-34) 25-29 0.192 —0.151 —0.826** | 0.064 
35-39 0.027 —0.046 —0.134 0.017 
40-44 0.516 —0.293 —0.180 —0.009 
45-49 0.094 0.285 —0.113 0.191 
50-54 0.324 0.118 —0.048 0.715** 
ee faa oxy [arse [ose 

Wife’s education < high school —1.086*** | —0.138 —1.845** | —0.627 
Two-year college | —0.068 0.181** 0.592** 0.572** 
ie al ee es 
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Four-year college | 0.143 1.348*** 1.762*** 1.448*** 
(0.609) (0.239) (0.238) (0.315) 


Couple’s education | Husband > wife | 0.484 0.178 —0.157 0.336 
Husband < wife | 0.493 —0.156 —0.062 0.029 


Husband’s income 


Number of children | 1 —0.928 —0.761** —0.475 0.032 

(ref: 0) (0.344) (0.310) (0.300) 
—1.152* —0.781*** | —0.905*** | —0.288 
(0.612) (0.298) (0.278) (0.280) 


3 or more 1.164" = | —1.237° | 1.262" | —0.304 
Preschool children | Yes —0.119 0.450" 0.467* 
Constant 0.685 —0.829 0.793 —0.596 
mn te aa 
Nagelkerke (Pseudo) R? 0.291 0.309 


Standard Errors are in parentheses below the estimates. * p<0.10, ** p<0.05, *** p<0.01 


—0.229*** =| —0.105*** | —0.261*** | —0.204* 
(0.062) (0.032) (0.040) (0.036) 


income. Younger couples may also face challenges in work-family reconciliation as 


well as older couples. 


Finally, the probability of being in an equal-provider couple decreases with the 
number of children (not significant in 2015, though). This suggests that the more 
children a couple have, the greater likelihood they were in a couple that husband 
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is a primary provider. However, interestingly, the presence of preschool children 
has no or very little effects on bread-winning arrangement, of which the sign and 
significance are not expected in our hypothesis. The previous study found that the 
presence of preschool children strongly negatively affects wife’s labour participation 
(e.g. [6]). However, this somewhat unpredictable positive effects might suggest 
polarization of occupational class-based outcomes among working mothers. 


4 Conclusion and Discussion 


We find the changing effects of women’s own human capital on contribution to 
household income: education is important for women to increase the probability of 
having a more equal division of labour and time within the marriage rather than a 
unequal role allocation, but it is not until the late 1990s that highly educated women 
show a higher probability of belonging equal-provider couple rather than male- 
breadwinner couple. However, the division of labour in marriage has not changed a 
lot and wives’ earnings still help to reduce income inequality across married couple 
households. 

Analysing differences of values and availability of policy from comparative 
perspective in future research could enrich theory and evidence about how intro- 
duction of policy might affect employment of married women, especially mother 
of preschool children. Moreover, asymmetry in terms of what percentage of 
household income wife and husband provide may be correlated with other aspects of 
asymmetric relationship such as division of roles in the home and family or couple 
relationship, which we leave to future research. 
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Weighted Optimization ®) 
with Thresholding for Complete-Case Greet 
Analysis 


Graziano Vernizzi and Miki Nakai 


Abstract Complete-case analysis, also known as listwise deletion method (LD), 
is a relatively popular technique to handle datasets with incomplete entries. It is 
known to be effective when data are missing completely at random. However, by 
reducing the size of the dataset it can weaken the final statistical analysis. We present 
an optimization algorithm that improves the size of the final dataset after applying 
LD. It is based on a constrained weighted optimization technique to determine the 
maximum number of variables and respondents from the initial dataset that are 
preserved after applying LD. The main feature is that the method allows for selecting 
a specific set of variables (or respondents) that must be kept during the optimization, 
while balancing their relative importance by means of suitable weights. Moreover, 
we provide analytic formulas for the optimal solution, that can be easily evaluated 
numerically, reducing the computational complexity associated to the usage of off- 
the-shelf packages for solving similar large constrained optimization problems. We 
illustrate the application of our weighted optimization method to some examples 
and real datasets. 


Keywords Missing data - Complete-case analysis - Constrained optimization 


1 Introduction 


Datasets are often plagued by incomplete entries, due to a variety of reasons: 
improper codes, faulty recording, missed questions, to name a few. Most statistical 
analysis procedures cannot incorporate missing data directly, and therefore a 
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number of different methods have been developed to eliminate or impute all missing 
data before proceeding with any analysis, e.g. complete-case analysis, or multiple 
imputation and full-information maximum likelihood. Among the many options 
available nowadays, and that have been implemented in conventional software 
suites, an effective (albeit drastic) technique to handle incomplete datasets is the 
complete-case analysis, also known as listwise deletion (LD) method. According to 
LD, any observation with at least one missing data entry is removed completely from 
the dataset. It is evident that the LD is advantageous only when a small percentage of 
the data are excluded this way. However, in several practical situations, a brute-force 
application of LD can deplete the dataset to a point where the number of data entries 
is not sufficient for a meaningful subsequent statistical analysis. In this work, we 
show how one can improve the applicability of LD, by selecting a suitable subset of 
variables from the dataset. By introducing suitable weights, the selection algorithm 
allows for the inclusion of any subset of variables that are considered essential, and 
cannot be eliminated. For the sake of conciseness, we do not summarize here the 
broad literature discussing LD applicability, for which we refer to [1, 7, 11, 14]. We 
only mention that LD is known to work best when data are missing completely at 
random [2, 8, 10, 12]. 


2 Weighted Optimization 


A dataset with L variables and N observations can be represented by a rectangular 
matrix X with N rows and L columns, with entries X;; representing the value 
of the i-th observation for the j-th variable. The presence of missing data can be 
recorded in a shadow matrix A [5]: where entries are Aj; = 0 (complete value) or 
Aj; = 1 (missing value). In situations where missing values abound, LD can reduce 
the size of the statistical sample dramatically. In such cases, it may be advantageous 
to exclude combinations of variables that are particularly plagued by missing values. 
However, which rows and columns should one delete in order to maximize the 
number of entries that remain, after applying LD? We illustrate the problem with 
an example: given the shadow matrix A for a dataset with N = 4 observations and 
L = 3, there are different combinations of variables that can be removed before 
applying LD: 


000 000 009 00 
fool foot] . [oor]. , [oo 
A=lo10} * 4> bere}? 4=lore]? 4>[or04- 

001 Oty ob] 0014 


There are only three missing entries. However, a direct application of LD 
would delete all rows but the first one, wasting six non-missing values. A different 
possibility is to delete the last two variables, which saves four non-missing entries, 
but also loses five in the process. One can show that the best solution would be to 
remove the third variable only: LD leaves six non-missing entries, i.e. 50% of the 
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original dataset. In general, there are combinations of variables (columns) in the 
matrix A that are optimal, in the sense that the number of remaining values after LD 
is maximal. 

The problem can be formulated mathematically. We introduce two column 
vectors, r and c, whose elements 7; (i = 1,...,N) andc; (j = 1,...,L) are 
binary numbers (0 or 1). The values r; and c; indicates whether the i-th row and 
j-column of X are deleted (zero) or not (one). For instance, the optimal solution 
in the last example corresponds to the vectors: r = (1; 1; 0; 1) and c = (1; 1; 0). 
For any given choice of r and c, the total number of missing entries d(r, c) is: 
d(r,c) = 5 ey ricjAij = r’ Ac, where r! indicates the transposition 
operation, and all products are understood as matrix products throughout this article. 
The total number of missing entries in the dataset is dora) = 1, Al L» where 1, 
indicates the all-one column vector with x elements. The goal is to find what binary 
vectors r and c render d(r, c) = 0 (which effectively implements LD) and maximize 
the number of remaining non-missing entries, i.e. maximize m(r, c) = r?(1— Adc. 
Such a constrained optimization problem over the field of binary numbers {0, 1} is 
a classic example of integer non-linear programming optimization, which is known 
to be NP-hard [4]. Much scientific literature transforms the discrete optimization 
problem into finding a global optimum over continuous variables [6]. However, 
the continuous version of m(r,c) over all real vectors r,c is quadratic but not 
convex, and the global optimum is not guaranteed to exist [9]. Moreover, in several 
practical applications, there may be variables or respondents that one does not wish 
to see removed from the analysis. We therefore consider a different problem, which 
is to find what vectors r and c render d(r,c) = O and maximize the number 
of variables and respondents that remain after LD. By associating the weights 
@; = | to each variable, and a weights w; > 1| to each respondent, the problem 
we consider is the minimization of the (convex) quadratic weighted functional: 
Fo = (r — In)’ Wr — In) + (c — 12)? Q(c — 11), where 2 = 4;ja; and 
W = 6;;w; area L x L and N x N diagonal matrices, respectively. The two 
problems are not independent. In general, the first problem maximizes the number 
of non-missing entries but does not guarantee to obtain the largest possible number 
of variables and respondents, not to mention the variables one wishes to keep during 
the optimization. The second problem maximizes the number of variables and 
respondents, but by doing so it may sacrifice some non-missing entries. For instance, 
in the example at the beginning of this section, the total number of variables and 
respondents preserved by the three cases are 4, 5, and 5, respectively, which is 
different from the number of non-missing entries 3, 4, and 6, respectively. The 
discrepancy can be mitigated in part since the second problem has several local 
minimizers in general, and in the Examples section we introduce a thresholding 
technique that can be used to select the minimizer with the highest number of 
non-missing entries. Surely, the main advantage of considering the second problem 
is that it is amenable to analytical treatment, and in fact it can be cast into an 
unconstrained optimization problem: 


F(W, 2) = (r — 1y)’ Wo — 1y) + €— 12)? Q(c— 114) +2Ad(,0), A) 
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where we introduced the Lagrange multiplier 24 for the constraint (the irrelevant 
factor 2 keeps the following expressions simple). Since W and 2 are positive 
definite matrices, so is also the quadratic form Fo. Moreover, the constrained 
minimization is on the closed set d(r, c) = 0, and a generalization of the Weierstrass 
theorem (see for instance [3]) guarantees the existence of a solution, which can be 
determined by the Lagrange method. Equation (1) can be minimized by determining 
the stationary points of F with respect to r, c, and A: 


F’: W(r—1n)+AAc =0 
Fi: Q(c—1z,)+AATr=0. (2) 
Be: r?Ac=0 


By solving the first equation with respect to r, the second equation with respect to 
c, and by substituting one into the other, we obtain: 


r = (Iy —22W-!AQ—1AT) | (Ly —AW-1A1Lz) eS 
c= (I, —¥2Q-1AT WIA) | (1, —A@-IAT Ly) | 


The matrices W~! and 27! always exist since they are positive definite. By 
inserting Eq. (3) in the third equation in (2), we obtain an equation for i: 


(12 —a1TATW-!) (Iy —22A2-!ATW-!) 1 A x 


x (Iz —22-!ATW-!A)! (1, —A@-!AT Ly) =0. (4) 


Such an equation simplifies considerably by using the singular value decomposition 
of the matrix S = W7!/2AQ7!/2, which is §S = U' SV where U isa N x N 
orthogonal matrix (i.e. UTU =UU' =I), VisaL x L orthogonal matrix (i.e. 
Viv=aVVi= I,), and X’ is arectangular NV x L matrix with diagonal elements 
only 2; = 6;;0;. The singular values o; are the s positive eigenvalues of the matrix 
S7 S. By inserting A = W!/2U? SVQ!/? in Eq. (4), and using the matrix identity 
(I— BCB~')-' = BA —C)7'B7! repeatedly, the last equation reads: 


(07 - ayT 7) (lly " aa) 53) (i. = ey y) (y / AZT p) =0 
(5) 


where p = UW?/21y, and y= V@'/21,. Due to the particular structure of the 
matrix &’, Eq. (5) can be written in terms of the singular values only: 


Yi — Avior) 0% (71 — Aoi) / (1 ~ 10?) =0. (6) 
i=1 


In general, Eq. (6) is a polynomial equation that does not admit a closed-form 
solution for A, but it can be solved numerically. 
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There may be situations when the inverse matrices enclosed by parentheses in 
Eq. (3) do not exist, i.e. when the determinants det (In =}? w-!A2-!AT) = 0 
or det (Iz —17Q7-1AT WwW! A) = 0. Such determinants are the characteristic 
equations for the eigenvalues of YZ? and also Y? Y, therefore, in terms of singular 
values that occurs only when A = 1/o; for some i. Equation (6) diverges at those 
values, which is an indication that the stationary points for F(W, S2) in Eq. (1) are 
actually at X = oo. This fact can be implemented numerically by simply plugging 
in Eq. (3) a sufficiently large value for A. An alternative approach for such cases is 
to multiply one of the matrices W or 2 by a constant factor: we found that even a 
small perturbation is sufficient to move the stationary points of Eq. (1) away from 
the singular values A = 1|/o;, which provides a finite solution for the optimization 
problem. 

Finally, a word of caution: when N or L are large numbers, the numerical matrix 
inversion in Eq. (3) can be a computational daunting task. In such cases, we can 
approximate the inversion by a Neumann series: (I —I°K y = Yo IK 
(geometric series expansion). In addition, in the particular “weightless” limit with 
92 = 1, and W = Iy, Eq. (3) simplify to: 


r= (Ly = waa) dyaiAly,. = (Iz = wala)” (1 - rAT Ly) 
(7) 


Equation (6) still holds, with o; being the singular values of the shadow matrix A. 
We conclude this section by commenting on the fact that several algorithms are 
nowadays available for solving large constrained quadratic optimization problems 
numerically. Nevertheless, we believe that the analytic formulas Eqs. (3) and (6) 
provide a higher vantage point when optimizing the functional F in Eq. (1), since 
the numerical evaluation of the above analytic formulas is computationally simpler 
than solving the full initial combinatorial optimization problem. 


3 Examples 


We illustrate our proposed optimization method with three examples. First, by using 
the matrix A from the example at the beginning of Sect.2, with all weights equal 
to 1, the singular value decomposition of A gives two singular values, 0, = V2, 
and o2 = 1, and the vectors p = (/2; 1,0; 1), y = C1; 1). The polynomial 
equation (6) for the constraint is: 3 — 24 — 1047 + 207 + 844 = 0. Such an equation 
admits two real solutions A ~ 0.92 and A ~ 0.54. After evaluating F(W, £2) in 
Eq. (1) with those roots, we pick the latter value, which gives r = (1, 1.1, 0.6, 1.1) 
and c = (1, 0.6, —0.2). We tested several methods to represent the real values of 
r,c in terms of 0 and | entries. Among them we recommend using the following 
thresholding method. Define c;(t) = O(c; — t), where 0(-) is the Heaviside step 
function and ¢ the threshold with minc; < t < maxc;. LD can be applied by setting 
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Fig. 1 Thresholding method applied to the example at the beginning of Sect. 2 (left), to the NLSY 
dataset (center), and to the SSP2015 dataset (right). The vertical axis e(t) shows the number of non- 
missing entries obtained by the optimization method. The function e(t) is maximal for a threshold 
value in a region around ¢* = 0.4 (left), t* = 0.0 (center), or t* = 0.9 (right), respectively, see 
dashed lines. If c; < t* its value is rounded to 0, otherwise it is rounded to 1 


r(t) = 40, ac(t). The fraction of non-missing entries that are still available after 
applying LD is e(t) = r(t)(n lt — A)c(t). A simple plot of e(t) in the interval 
min(c;) < t < max(c;) shows a region with a maximum, whose location determines 
optimal threshold values for t. Figure 1 (left) shows the plot for the example under 
consideration. For instance, by choosing the thresholding value t = 0.4, we obtain 
c(t) = {1; 1; O} andr (rt) = {1; 1; 0; 1}. That leaves e(t) = 6 remaining non-missing 
entries, which is the optimal case as discussed at the beginning of Sect. 2. 

As a second example, we apply our optimization method to the dataset used 
in chapter 4 of [2], where 581 children were interviewed in 1990 as part of the 
National Longitudinal Survey of Youth (NLSY). As in [2], we consider only the 
eight variables: “ANTT’, “SELF”, “POV”, “BLACK”, “HISPANIC”, “DIVORCE”, 
“GENDER”, “MOMWORK” (see [2] for details). In this case, the complementary 
matrix A has rank k = 4, and gives a polynomial equation for 4 of order 
fourteen. Four solutions are real, and among them only one corresponds to a 
minimum of F. The corresponding thresholding plot is in Fig.1 (center). The 
thresholding value t = 0.0 gives e(t) = 2382 remaining non-missing entries, 
and c = (1;0;0; 1; 1; 1; 1; 1), which means that if we discard the variables 
“SELF”, and “POV” before applying LD, the maximum number of data entries 
is preserved. More precisely, the full dataset has 4038 not missing entries. If one 
applies LD directly (without optimization), only 1800 entries remain. However, 
our optimization algorithm finds that after removing the two variables “SELF” and 
“POV”, LD leaves 2382 entries (32% increase). Interestingly, if one discards “POV” 
only (which is in fact the variable with most missing data, corresponding to 431 
non-missing entries), LD leaves only 2114 entries (17% increase). To be absolutely 
certain, we performed an exhaustive combinatorial search over all 2® = 128 possible 
binary vectors c, and verified the validity of this result. 

As a third example, we test the possibility to add weights: the purpose is to either 
favor (or penalize) groups of variables that the user wishes to prioritize. For this 
example, we use the 20/5 Japanese survey on Stratification and Social Psychology 
(SSP2015) [13]. The survey collects face-to-face interviews of a randomly selected 
sample of the Japanese population represented by men and women with age 20- 
64. The dataset was compiled in 2015 from a total of 3575 respondents (1644 
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men and 1931 women) over 171 variables. However, 89752 entries are missing, 
corresponding to 14.7% of the whole dataset. When one applies LD directly on the 
whole dataset, all respondents get deleted. On the other hand, the optimized LD 
method with no weights (e.g. W = Iy, 82 = I), produces a constraint equation for 
A, Eq. (6), with only three real solutions, one of which corresponds to a minimum 
for F. By using the thresholding technique we find 114 variables that leave 2880 
respondents. Such a solution is optimal, in the sense that no other group of variables 
leaves more data entries after applying LD. The solution found by the optimization 
algorithm in this example, corresponds to 53.7% of the initial dataset, which is a 
great improvement over the direct LD without optimization. Now, let us suppose 
we are interested in a specific subset of variables, that we consider particularly 
important and should not be deleted by the optimization algorithm. For instance, 
we pick the following variables from SSP2015: 


. variable g/_/: gender 

. variable age: age 

. variable g4: educational level 

. variable g6_/: current employment status 
. variable g31_1]: respondent’s income 

. variabel g31_2: household income 


Nn BWNY Re 


It turns out that only the first four variables appears among the 114 variables that 
optimize LD, while q3/_/ and q31_2 are excluded. The request of keeping g3/_1 
and q3/_2 will lead to a sub-optimal solution necessarily, i.e. less data entries will 
be available for the final analysis. Therefore, we associate weights w; = 10 to the 
above six variables, and weights w; = 1| to all remaining variables. Moreover, we 
give equal weights to all respondents W = Iy. The weighted optimization method 
leads to a solution of Eqs. (3) and (6) that selects 116 variables including also all the 
above six variables. However, LD deletion over all 116 variables leaves only 2637 
respondents this time, corresponding to 50% of the dataset, corresponding to a 3.7% 
loss with respect to the optimization without weights. It is interesting to compare this 
result with the case where only those six variables are considered while all other 
variables are removed from the dataset. In this case, LD leaves 3182 respondents 
over six variables, i.e. a loss of 393 respondents only. Whether this case is better or 
worse than the one from the weighted optimization, depends mostly on the ultimate 
goal of the statistical analysis. There are situations where it may be advantageous 
to restrict the analysis to small subsets of the variables. In such cases, one must be 
careful when comparing results between different groups because LD may lead to 
different groups of respondents for different group of variables, which is an approach 
that has been strongly deprecated in the literature. In other situations, it may be 
preferable to work with the largest possible subset of the dataset, which is complete 
and can be used in the final statistical analysis. For instance, six variables and 3182 
corresponds to 3.1% of the whole dataset, which is rather small when compared to 
the subset of 116 variables with 2637 respondents (about 50% of the whole dataset). 
Nonetheless, the weighted optimization of the LD method is sufficiently flexible to 
adapt to several situations reliably. 
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4 Conclusions 


In this work we described how one can optimize the use of LD by maximizing the 
number of variables and respondents that remains after applying LD. The method 
is deterministic, and it provides a quantitative numerical guideline for selecting 
the optimal subset of variables. Moreover, the method allows also to prioritize 
groups of variables (and/or respondents) by means of weights in the optimization 
equations. We also perfected a general thresholding method that greatly helps the 
numerical implementation of the optimization algorithm. We tested it on several 
cases, and verified it is sufficiently powerful and robust to provide reliable results. 
Obviously, such a method should complement other heuristic approaches, and 
general considerations about the dataset. Although we do not have sufficient space 
to discuss all details here, we remind that one should always try to identify the 
reasons for missingness, in order to give a correct interpretation of the distribution 
of missing data, and to select the best method to analyze the dataset. Moreover, 
LD can be applied effectively without bias when data are missing completely at 
random, and it cannot substitute other methods (such as imputation methods) that 
are more effective when data are missing at random, or not at random (see, e.g. 
[{1, 11]). Furthermore, our algorithm does not take into account the quality of the 
selected data, and therefore it could lead to inconvenient datasets. Nevertheless, our 
approach is sufficiently flexible to allow the inclusion of information on data quality, 
via the assignment of specific weight factors at the beginning of the optimization 
procedure. 
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Part IV 
Graphical Models 


Measurement Error Correction ® 
by Nonparametric Bayesian Networks: crests 
Application and Evaluation 


Daniela Marella, Paola Vicard, Vincenzina Vitale, and Dan Ababei 


Abstract In this paper a procedure for measurement error correction based on 
nonparametric Bayesian networks is proposed. The performance of the proposed 
method is evaluated using a validation sample collected by Banca d'Italia and 
a major Italian bank group to investigate the measurement error mechanism in 
the main financial variables amounts observed in the Banca d’Italia survey on 
Household Income and Wealth. Specifically, in this paper attention is focused on the 
bond amounts. By means of Uninet’s programmatic engine working directly from 
R, data can be corrected unit by unit by sampling from the nonparametric Bayesian 
network. Thanks to the validation sample, the distances between the true and the 
imputed values are computed and the procedure is evaluated. 
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1 Introduction 


Variables observations are often affected by measurement error so that values 
observed in the data collection stage are different from the true ones. Measurement 
errors may lead to large bias effects on the estimation process. As a consequence, 
preliminary measurement error detection and correction should be performed before 
applying standard statistical inference techniques in order to avoid a serious impact 
on the survey results quality. To this aim, the error generating mechanism is modeled 
and estimated; then, microdata imputation is carried out. In this paper we focus 
attention on the respondent measurement error and we analyze and correct it using 
Bayesian networks. 

When the variables are categorical, standard Bayesian networks (BNs, [2]) 
have been proposed as a tool to deal with measurement error. Simulations and 
applications are illustrated in [7] and [8]. For continuous variables a preliminary 
discussion is in [9]. 

In this paper we focus on continuous variables (such as, for example income, 
bond and share amounts) whose distributions is not necessarily Gaussian. The 
error generating mechanism is estimated using nonparametric Bayesian networks 
(NPBNs). In such a way, continuous data can be analyzed without any preliminary 
discretization process and unrealistic assumption of Gaussian distribution. 

Moreover, an automatic procedure for measurement error correction based on 
sampling from the estimated NPBN is introduced and applied to a validation sample 
associated to the Banca d’Italia survey on Household Income and Wealth (SHIW, 
for short) and whose questionnaire and survey design are very close to those used 
in SHIW. 

The paper is organized as follows. In Sect.2 nonparametric Bayesian networks 
are briefly introduced. In Sect. 3 an application of nonparametric Bayesian networks 
to the validation sample provided to us by Banca d’Italia is illustrated. Results are 
shown in Sect. 3.1. Final discussion is in Sect. 4. 


2 Nonparametric Bayesian Networks 


BNs are multivariate statistical models satisfying sets of (conditional) independence 
statements contained in a directed acyclic graph (DAG). A DAG is a pair G = 
(V, E) where V is the set of nodes and E is the set of directed edges between pairs 
of nodes. Each node represents a random variable, while missing arrows between 
nodes imply (conditional) independence between the corresponding variables. A 
directed graph is acyclic in the sense that it is forbidden to start from a node and, 
following arrows directions, go back to the starting node. Given that BNs are condi- 
tional independence models, they allow to describe and to read independencies from 
the DAG. There are properties connecting the concept of conditional independence 
between variables and absence of an arrow in the graph; these are encoded in the 
Markov properties. For more details, we refer to [6]. 
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When the variables of interest are continuous, structural and parameter learning 
and evidence propagation can be performed under the assumption of Gaussian 
distribution [2]. However, in many real cases variables distributions are so far 
from normality that the Gaussian assumption becomes completely unrealistic. In 
such circumstances, continuous variables are generally discretized, and inference 
techniques for discrete BNs are used. 

To avoid inappropriate assumption and discretization, multivariate nonlinear 
complex dependence structures can be modeled by copulas giving rise to NPBNs; 
for details, see [1] and [3]. 

Differently from standard BNs, in continuous NPBNs [5] nodes are associated 
with continuous invertible distribution functions and edges with (conditional) 
rank correlations that are realized by a chosen copula. Here the joint Gaussian 
copula is used. It satisfies the zero independence property: zero rank correlation 
is equivalent to zero partial correlation that, in turn, is equivalent to zero conditional 
correlation. Finally, the latter implies conditional independence and absence of the 
corresponding edge. 

The main advantage of nonparametric networks is that the absence of an arc can 
be still interpreted as a (conditional) independence statement (as for standard BNs) 
so that the DAG can still be used to represent the conditional independence relations 
proper of a set of continuous variables of interest. 


3. NPBNs and Measurement Error—Application and Results 


In this paper, a NPBN based measurement error model for data on bond amounts 
in the survey on household income and wealth 2008 is estimated and a procedure 
for the detection and correction of measurement error is proposed. A validation 
sample, provided to us by Banca d’ Italia, has been used to estimate the network and 
to evaluate the correction procedure performance. 

SHIW is a biannual sample survey conducted by Banca d’Italia. Its main objec- 
tive is to study the main sources of wealth (such as income, dwelling, investments 
in bonds, shares, and other financial products) of Italian households. Data in the 
validation sample have been collected through an independent experiment survey 
done by Banca d’Italia and a major Italian bank group on a sample of the latter. The 
survey was carried out in 2003 on a sample of 1681 households where at least one 
member was a customer of the bank group. Then survey data were matched with the 
bank customers database containing the amount of bonds and shares actually held 
by the statistical units selected in the sample. 

A direct comparison of reported and true bond amounts shows that the 87% of 
household data are affected by underreporting; therefore, we proceed to correct 
bond amount declared values through a procedure based on NPBN. Since data 
quality suffers from a series of inconsistencies, preliminary data cleaning is carried 
out to improve data accuracy. In order to avoid inconsistencies ascribable to an 
individual having more than one bank account with investments in bonds, all 
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Table 1 Description of the variables analyzed in the NPBN model 


Variable Variable description 
IV_AFNORISK True amount of bonds 
F_AFNORISK Amount of bonds 

F_AFRISK Amount of stocks 

ETA Age of the head of the household 
Y Income 

YM Self-employment income 

YLM Payroll employment income 

AR Real assets 

F_AF1 Certificates of deposit 

F_AF2 Italian government securities (BOTs, CCTs, etc.) 
F_AF3 Italian bonds and foreign securities 
F_AF4 Quoted and not quoted shares 
F_AF5 Mutual funds 

F_AF6 Asset management 

SUPAB Surface of dwelling 

LIE Propensity to underreport 


multi-banked customers (709) have been eliminated. Furthermore, 19 units with 
incoherent information are dropped out of the dataset. The final sample size is 844. 

The list of the studied variables, together with their description, is reported 
in Table 1. All the variables, except the true bond amount owned by households 
(IV_AFNORISK), are given by the values declared by the respondent. Differ- 
ently from previous works [10], the variable LJE is continuous; it measures the 
propensity to underreport the true amount of bonds. We consider a respondent as 
a liar when the relative difference between the declared and the true bond amounts 
is greater than 10%, distinguishing misreporting in bona fides, attributable to an 
objective difficulty in retrieving the correct information, from the intentional one. 

The analysis has been carried out using the software UniNet! where the joint 
Gaussian copula is used and the learning algorithm presented in [4] is implemented. 

Notice that, since LJ E is a continuous variable, the network structure can be 
learned directly from the overall set of data without the necessity to impose an arrow 
from F_AFNORISK to LIE, as done in [10] to overcome the problem of LJ E 
being a binary variable. 

In order to learn the structure, the variables are preliminarily ordered according 
to subject-matter knowledge and time/logical ordering. The NPBN is estimated 
starting from the saturated graph and computing all the rank and partial rank 
correlations associated to the edges. Next, those edges characterized by small rank 
correlations are removed. 


1 www.lighttwist.net/wp/uninet. 
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Fig. 1 NPBN for propensity to underreport and bond amount value correction 


In our analysis of the validation sample, all edges with associated partial 
correlation larger than or equal to 0.1 are retained. The remaining rank correlations 
vary, in absolute value, from 0.10 to 0.74. The resulting NPBN is shown in Fig. 1. 
The network shows that the propensity to underreport, L/E, is directly influenced 
not only by the declared (F_AFNORISK) and by the true (7V_AFNORISK) 
bond amount, but also by age (ETA) and by some financial activities (F_AF2, 
F_AF3, F_AF6). Differently, income (Y) affects the underreport propensity 
indirectly only, ie., via FLAFNORISK, IV_AFNORISK, F_AF2, F_AF3 
and F_AF6. 

The estimated NPBN has been validated using statistical tests based on the rank 
correlation matrices determinant and implemented in Uninet. Notice that all the 
above determinants take values in [0,1] and are equal to | if all variables are 
independent, and equal to 0 if there is linear dependence between the variables. 
More specifically, the validation phase consists in two steps. The first one compares 
the determinant (DNR) of the empirical normal rank correlation matrix with that of 
the empirical rank correlation matrix (DER) to validate the joint normal copula. The 
second one compares the determinant (DBN) of the rank correlation matrix of the 
proposed NPBN with the determinant DNR to test if the estimated network is an 
adequate model of the saturated graph. 
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We use the NPBN in Fig. | to detect and correct measurement errors in the bond 
amounts. To this aim we propose the following three steps procedure: 


1. Estimation of the propensity to underreport A by inserting and propagating the 
evidence E given by the observed values of all variables except the true bond 
amount 1V_AFNORISK. 

2. Estimation of the probability distribution of IV_AFNORISK given all 
the observed values by inserting and propagating the updated evidence, 
ie. (E,LIE = A); throughout the network. The individual true value for 
IV_AFNORISK can then be predicted by a random draw from such a 
distribution and is denoted by ]_AFNORISK. 

3. Computation of the imputed value. It coincides with the original one if i < 
0.2, otherwise it is given by the linear combination: AI_AFNORISK + (1— 
d) F_AFNORISK. 


In order to evaluate the performance of the above correction procedure, steps 
1-3 have been applied to all units in the validation sample by means of Uninet’s 
programmatic engine directly working from R (using the RDCOMClient library 
to connect to the engine). After setting the conditioning nodes to the observed 
values in the NPBN, Uninet calculates the underlying Gaussian conditional joint 
distribution analytically. The conditional distributions of the output nodes are then 
obtained by transforming the corresponding marginal distributions from Gaussians 
to their original ones. Notice that without this engine the conditionalization could 
be performed only by inserting a unit at time, making the overall dataset correction 
nearly impossible. 


3.1 Results 


Results arising from our imputation procedure, displayed in Table 2, are very 
promising. For comparison purposes the mean, in absolute value, of the distances 
between the true amount of bonds and the declared and the imputed ones respec- 
tively, are computed. As shown in Table 2, the proposed approach reduces the 
distance from the true values of 8.5%, on average. 

From Table 2 it is also evident that the estimated network in Fig. 1 performs 
particularly well when it takes into account the liars only; for this group, the 
imputation procedure reduces the distance from the true values of 13.6%. 


Table 2 Imputation procedure performance 


Distance true — observed value | Distance true — imputed value | Relative difference 
Group | (Mean) (Mean) (%) 
ALL 59594.23 54507.86 —8.5 
LIARS | 104456.8 90267.56 —13.6 
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Fig. 2. Kernel density estimation of the observed, imputed, and true distribution of bonds—the 
group of liars 


Finally, Fig.2 shows the kernel density estimates of the observed (OBS), the 
imputed (IMP), and true (TRUE) amount of bonds respectively, for the liars only. 
The density of imputed values tends to reduce the number of low declared values 
with a distribution nearest to that of the true one. Analogous results are obtained 
when the whole sample is considered. 


4 Conclusions 


In this paper nonparametric Bayesian networks have been used to model the 
underreporting generating process affecting the Banca d’ Italia Survey on household 
income and wealth. Moreover, a new procedure to sample from the NPBN and 
automatically correct the data has been introduced and evaluated on a validation 
sample associated to SHIW. The results are promising; therefore, the proposed 
imputation procedure could be a valid tool to deal with measurement errors when 
continuous variables with non-normal distribution, such as financial assets, are 
considered. A possible limitation of this analysis is given by the Gaussian copula 
assumption. Therefore further research could focus on the possible use of different 
copula families and on the consequent possibility to efficiently sample from the 
associated Bayesian network. 

Another aspect deserving attention is that of the distribution shape of the variable 
to be corrected. As shown in Fig. 2, in our application the distribution is strongly 
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asymmetric. As a consequence, the random draw from the conditional distribution 
may generate particularly large bond amounts, thus reducing the improvement due 


toi 


mputation. In order to avoid this problem or to limit its effects, one solution could 


consist in identifying homogeneous groups with respect to their response behavior, 
and treat them separately in the imputation phase. 
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Copula Grow-Shrink Algorithm ®) 
for Structural Learning Crest 


Flaminia Musella, Paola Vicard, and Vincenzina Vitale 


Abstract The PC algorithm is the most known constraint-based algorithm for 
learning a directed acyclic graph using conditional independence tests. For Gaussian 
distributions the tests are based on Pearson correlation coefficients. PC algorithm for 
data drawn from a Gaussian copula model, Rank PC, has been recently introduced 
and is based on the Spearman correlation. Here, we present a modified version of 
the Grow-Shrink algorithm, named Copula Grow-Shrink; it is based on the recovery 
of the Markov blanket and on the Spearman correlation. By simulations it is shown 
that the Copula Grow-Shrink algorithm performs better than the PC and the Rank 
PC algorithms, according to the structural Hamming distance. Finally, the new 
algorithm is applied to Italian energy market data. 
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1 Introduction 


A Bayesian network (BN, [2]) is a graphical model representing the multivariate 
probability distribution of a set of variables by means of a directed acyclic graph 
(DAG). BNs are applied in very many real contexts for their easy-to-read pictorial 
representation of complex problems and for their capability to evaluate scenarios. 
In fact, BNs can be provided with an inference engine allowing to carry out what-if 
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analysis by means of computationally efficient algorithms. However, building a BN 
can be tricky: most of times, the dependencies are unknown or partially known, so 
that the DAG can not be built manually but has to be estimated directly from data. 
Many structural learning methods are suitable for discrete data and for normal data, 
but only a few for non-Gaussian ones. For recent interesting developments about 
the use of DAG for non-normal data via copula function, we can refer to [1, 3]. 
Here, a Gaussian copula model is considered, thus a structural learning algorithm 
for nonparanormal data is proposed. 

The paper is organized as follows: nonparanormal graphical models are briefly 
recalled in Sect. 2; known structural learning methods together with the proposed 
one are discussed in Sect. 3; simulation results are addressed in Sect.4.1 while a 
real case application is shown in Sect. 4.2. Section 5 addresses the conclusions. 


2 Nonparanormal Graphical Models 


A DAG is a mathematical object made of a finite set of nodes and directed edges 
arranged without producing directed cycles. Nodes of a DAG are associated with 
random variables, either discrete or continuous, and arrows between nodes represent 
direct relevance of one variable to another. A node, say X;, is said parent of another 
node, say Xj, if there is an arrow from X; to Xj; correspondingly, Xj; is said 
child of X;. The DAG is thus a map of conditional independence statements that 
can be read by means of the d-separation criterion [13]. DAGs entailing the same 
set of conditional independence relations are called Markov equivalent and can be 
represented by a Partially DAG (PDAG) or, uniquely, by a Completed Partially DAG 
(CPDAG). 
Recently, nonparanormal graphical models have been defined in [8] as follows: 


Definition 1 Let f = (fi)yey be a collection of strictly increasing functions 
fo : R +> Rand ¥ € R**" bea positive definite correlation matrix. The 
nonparanormal distribution NPN (f, 2’) is the distribution of the random vector 


(fo (Zu) yey for (Zy)yey ~ N (0, &). 


Definition 2 The nonparanormal graphical model NPN (G) associated with a 
DAG G is the set of all distributions NPN (f, 2’) that are Markov with respect 
to G. 


The function f, realizes a deterministic transformation on Z, preserving the 
same dependence structure of the underlying latent multivariate normal distribution 
in the nonparanormal model. 

If X ~ NPN (f, X’) and Z ~ N (0, »), then X4 IL XB|Xs & Za IL Zp|Zs, 
for any triple of pairwise disjoint sets A, B, S C V. For two nodes (u, v) and a 
separating set S we have X, lL Xy|Xs } pyyv|s = 0. Accurate estimators for latent 
normal correlation coefficients are produced by a trigonometric transformation on 
Spearman rank correlation (7). Reference [6] shows that if (X, Y) is a bivariate 
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normal with Corr (X, Y) = p, then: 


P (i2sin (=) —p|> e) < 2exp (-s2"") (1) 


Since F depends on the observations via their ranks that are preserved under 
strictly increasing functions, (1) still holds for nonparanormal graphical models with 
Pearson correlation p = 2’yy in the underlying latent bivariate normal distribution. 
Therefore, o is estimated by Pearson formula [9] as: 


p = 2sin (= -?) (2) 


The same transformation still holds for the partial correlation coefficients. 


3 Structural Learning 


The BN learning process consists of two phases: building the DAG corresponding to 
the conditional independence statements, and estimating marginal and conditional 
probability distributions. Most of times the networks have to be learned from data. In 
such situation computationally efficient algorithms are needed. Structural learning 
methods are mainly developed according to two approaches: scoring and searching 
techniques spanning the space of models and selecting those optimizing an infor- 
mation criterion, and constraint-based algorithms using conditional independence 
tests. The main constraint-based methods are briefly presented below. 


PC Algorithm and Rank PC Algorithm 
The PC algorithm [11] is a backward algorithm consisting of three steps: 


1. identification of the skeleton of the graph (i.e., the underlying undirected graph) 
by recursively testing marginal and conditional independencies; 

2. identification of v-structures—three nodes configurations such as X; > Xx <— 
X;, standing for conditional dependence between X; and X; given X,—on the 
basis of the test results of the previous step; 

3. orientation of the remaining undirected links without producing additional v- 
structures and/or directed cycles. 


For multivariate normal observations, in [4] it is shown that PC algorithm has 
high-dimensional consistency properties. In case of normal data, PC algorithm tests 
conditional independence between two variables, say X and Y given a separating set 
S, by computing the partial correlation px. y\s. It holds that: X lL Y|S } px.y|s = 
0. The sample partial correlation px. yjs is used as a good estimate of px-yjs. 


In many situations variables are not Gaussian, then a PC algorithm rank version, 
named Rank PC (RPC) algorithm, has been proposed by Naftali and Drton [8]. RPC 
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algorithm tests conditional independence between two variables, say X and Y, given 
a separating set S by computing the rank-based partial correlation estimates between 
X and Y in (2). The RPC algorithm consistency is proved in [8] under some non- 
strict assumptions. It is shown that RPC works at the same strength of PC algorithm 
for normal data, but considerably better for non-normal data under the assumption 
of joint distribution following a normal copula model. The PC and RPC algorithms 
are implemented in pcalg R package [5]. 


Grow-Shrink Algorithm and Copula Grow-Shrink Algorithm 

The Grow-Shrink algorithm (GS, [7]) uses the concept of Markov blanket of a 
variable. The Markov blanket of a node X, MB(X), consists of all parents, children, 
and parents of children of X. MB(X) d-separates X from any other variable outside 
MB(X). In other words, MB(X) contains all the variables in the graph carrying 
information about X. The GS algorithm focuses on the recovery of MB(X) using 
pairwise independence tests. It consists of two phases. In the first phase (growing), 
MB(X) is initially an empty set, denoted by S. Then the algorithm adds variables to 
S as long as they are associated with X given the current contents of S. In this phase, 
even variables not really belonging to MB(X) could be added to S. The second 
phase (shrinking) is performed to identify and remove these variables. The GS is 
implemented in bnlearn R package [10]. 


Here the Copula Grow-Shrink (CGS) algorithm is proposed. It has the same 
logical structure as GS but the marginal and partial correlations coefficients used in 
the statistical test for independence are computed through (2). CGS algorithm allows 
to learn the structure even when data are non-Gaussian. In this way the unrealistic 
normality assumption and preliminary data discretization can be avoided. 


4 Experiments 


In this section the CGS performance is discussed both in comparison with other 
algorithms by a simulation (Sect. 4.1) and by an application to real data (Sect. 4.2). 


4.1 Simulation 


According to the procedure implemented in the pcalg R package [5], and to 
ensure faithfulness (1.e., the exact correspondence between conditional independen- 
cies in the distribution and in the DAG), a random DAG, made of ten nodes and 
with sparsity parameter s = 0.4, is simulated (see Fig. 1). Data are sampled from 
it following: (a) multivariate normal distribution; (b) Gaussian Copula distribution, 
whose latent multivariate normal distribution is that of (a); (c) contaminated data 
from a mixture of Gaussian (80) and Cauchy (20) distributions not belonging to 
the nonparanormal models, as in [8]. We sample 250 distributions for each type 
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Fig. 1 The simulated graph 


Table 1 Simulation results by sample size, data type and algorithm 
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of data described above, fixing n € {50,500}. On every training set, structural 
learning has been performed using PC, RPC, GS, and CGS algorithms with a 
0.01 significance level. Algorithm performances have been compared in terms of 
structural Hamming distance (SHD, [12]), a measure counting the number of actions 
(add, delete, reverse) necessary to transform the estimated graph into the true one. 
In Table | the SHD mean and standard deviation relative to all simulations are 
reported. For small sample size (n = 50), results show that the mean value is 
always smaller for the CGS algorithm than for PC, RPC, and GS when data are 
not normal; the standard deviation is instead quite unstable due to the presence of 
outliers (see Fig. 2a). Box-plots comparison shows that outliers, present in CGS 
box-plot, correspond to non-anomalous values in the PC and RPC distributions. For 
large sample size (n = 500), with the exception of normal data, the mean value 
is always much smaller for CGS than for PC, RPC, and GS algorithms. The CGS 
distribution is concentrated on SHD values smaller than those for PC, RPC, and 
GS distributions (box-plot in Fig. 2b). CGS box-plot shows many outliers, but they 
correspond to non-anomalous values in the other algorithm distributions (Fig. 2b). 
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Fig. 2. Box-plot of SHD performance by algorithms and data typology for n = 50 (a) and n = 500 
(b) 


4.2 An Application to Real Data: The Italian Energy Market 


Here, the CGS algorithm is applied to real data provided by the most important 
multi-national power company in Italy, the Enel group. The energy production 
cost is generally directly proportional to energy demand: as demand increases, 
the number of energy plants has to be upgraded accounting for production plants 
efficiency (first the most efficient, less expensive, then the less efficient, more 
expensive and sometimes polluting). Variability in the energy demand, due to 
seasonal fluctuations, causes uncertainty in identifying the optimal amount of 
production. Statistically speaking, the association structure among the variables of 
interest constitutes a relevant information for energy managers, and BNs can be a 
useful tool to estimate these relationships. 

Data are referred to the Italian energy market and their monthly mean values 
span from January, 2014 to May, 2017.! As shown in Table 2, the variables involved 
in the analysis concern the energy price and its demand, the main important energy 
commodities (out of which Hydro, Wind, Solar, and Geothermal are renewable), and 
their costs.2 The variables distributions are non-Gaussian, then a structural learning 
algorithm for non-normal data, like CGS, is needed. 

The CGS algorithm allows to incorporate some prior knowledge in the structure 
learning process. Here, the following arc directions are forbidden: Gas and Coal 
production towards all renewable energies and towards the Demand. The resulting 
graph, for w@ = 0.05, is shown in Fig. 3. 


‘As requested by Enel experts, we did not treat the time series before modelling them, since the 
aim is catching and modeling the variability of the energy market as a whole, including variables’ 
seasonality and non stationarity. 


2Hereafter, node names will coincide with variable names and will be written in italic. 
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Table 2 Description of variables involved in the analysis 


Name Description 

Demand Average total monthly energy demand (in MWh) 

Hydro Average monthly energy produced by hydroelectric power plants (in MWh) 

Wind Average monthly energy produced by wind power plants (in MWh) 

Solar Average monthly energy produced by solar power plants (in MWh) 

Geothermal Average monthly energy produced by geothermal power stations (in MWh) 

Gas Average monthly energy produced by thermal power plants burning gas (in 
MWh) 

Coal Average monthly energy produced by thermal power plants burning coal (in 
MWh) 

Others Average monthly energy produced by other power plants (in MWh) 

Gas cost Average monthly gas price (in Euro per MWh) 

Coal cost Average monthly coal price (in Euro per MWh) 

Energy Average monthly energy price (in Euro per MWh) 

price 





Fig. 3. DAG for the Italian energy market learned by CGS algorithm 


The model seems to well reflect the dependence structure of the Italian energy 
market. It is known that, in Italy, the main power sources are natural gas and 
hydroelectricity; the national energy plan also includes an increasing power gen- 
eration from all other renewable sources. In particular, Italy is among the largest 
producers of electricity from solar energy; wind and geothermal powers also give 
a contribution to satisfy the national energy demand. All these features are clearly 
depicted by the estimated network structure. In fact, the Hydro production influences 
all the other renewable sources and the Gas production levels. As expected, the 
energy Demand has a direct effect on Coal and Gas production but not at all on the 
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renewable energy productions since their levels depend on weather and seasonality 
only. The Energy Price is directly affected by the Coal production and by the costs 
of both non-renewable energy commodities (Coal Cost and Gas Cost). All other 
variables influence the energy price indirectly; for instance, Demand impacts on 
Energy Price via Coal production. 


5 Conclusions 


In this paper the issue of BN structural learning for nonparanormal data has been 
addressed. The Copula Grow-Shrink algorithm is proposed, based on the recovery 
of the Markov blanket of the nodes and on the Spearman correlation. The paper 
provides both a simulation study to compare, in terms of learning performance, the 
proposed CGS algorithm to PC, RPC, and GS algorithms, and an application to the 
Italian energy market data. The application shows that the algorithm is appropriate 
for catching relations in a complex field such as that of an energy market. 

The simulation results are very promising and other evaluations are going to 
be done in the next future. Among these: comparing the skeleton identification 
ability by using additional performance measures such as TPR, FPR, and TDR; 
analyzing the reason of the outliers in CGS distribution to improve its robustness; 
and comparing simulation results for different levels of DAG sparsity. As for 
all constraint-based algorithms, the sensitivity to the a value deserves particular 
attention since there might be a lack of robustness with respect to different a 
specifications. Therefore, further research will be devoted to this aspect. 
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Context-Specific Independencies ®) 
Embedded in Chain Graph Models Greet 
of Type I 


Federica Nicolussi and Manuela Cazzaro 


Abstract For a set of variables collected in a contingency table, we focus on a 
particular kind of relationships such as the context-specific independencies. These 
are conditional independencies that hold for particular values of the conditioning set. 
Given the advantages of the graphical models, we use them to represent different 
relationships among the variables, including the context-specific independencies. 
In particular, we enrich chain graph models with labelled arcs. Furthermore, we 
consider the well-known relationships between chain graph models and hierarchical 
multinomial marginal models and we introduce new constraints on parameters in 
order to describe the context-specific relationship. Finally, we provide an application 
to the study of innovation in Italy by comparing two different periods. 


Keywords Context-specific independencies - Categorical variables - Ordinal 
variables - Stratified chain graph models 


1 Introduction 


A context-specific independence (CSI) is a particular relationship that focuses 
on certain value(s) of conditioning variables. Indeed, it is not rare to observe 
phenomena that are independent under particular conditions, but, under other 
circumstances, they have on the contrary a strong connection. In this case, stating 
that there is conditional independence between the two phenomena is not true, but 
not considering the lack of “partial” connection could be inaccurate. In this work we 
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consider a set of categorical variables and we study different kind of relationships, 
among which the CSIs that lie between them. In the literature, marginal and 
conditional independencies get more attention and are deeply studied. For instance, 
given two variables, say X; and X2, it is usual to investigate if they are marginally 
independent (X; -_ Xz2) or conditionally independent given a third variable X3 
(X, L X2|X3). The CSI statement establishes that the variables X; and X2 are 
independent given X3 = i3, while the same statement does not hold when X3 4 13; 
see among others Boutilier [2]. Indeed, these CSIs were mainly examined to study 
problems concerning latent variables; see, for instance, [13]. 

Our aim is to incorporate the CSI conditions in graphical models that are suitable 
to represent different kind of relationships. Nyman et al. [11, 12] analyse CSIs in 
graphical models based on undirected graphs, or on directed acyclic graphs, using 
the classical log-linear parametrization. In both papers they adapt these kind of 
graphs with labelled arcs in order to take into account the CSIs. In this work we 
follow the same approach and we enrich chain graphs with labelled arcs in order to 
display also the CSIs. Furthermore, we take advantage of hierarchical multinomial 
marginal (HMM) parametrization [1,4], as a generalization of the log-linear models, 
to represent the dependence relationships. A further advantage to consider the CSIs 
lies also in the possibility of reducing the number of HMM parameters. 

This paper has the following structure: In Sect. 2.1 we give an overview of HMM 
models by considering also the case when we deal with ordinal variables. About 
this, we propose the constraints on HMM models able to satisfy the CSIs. The 
representation through (stratified) chain graph models is debated in Sect. 2.2. In 
Sect.3 an application on the study of the trend of innovation degree on Italian 
enterprises is provided. Finally, Sect. 4 is dedicated to a conclusion. 


2 Methodology 


Let us consider qg categorical (ordinal) variables V = {xX Trees xX, taking values 
in the contingency table .% = (.4 x --- x 4%), where Y%; = {1,...,n;}, with 
j =1,...,q, such that i; € .%; is the generic valu of the Saniabie X;. ee that 
(ij, ..., tg) identifies a particular cell of the contingency table 7% at henceforth 
we refer to it with the shortcut (i1,_). In the following subsection we describe the 
methodology able to define a system of independencies (marginal, conditional and 
context specific) that reveals the relationships among all the variables involved in 
the contingency table. 


2.1 Hierarchical Multinomial Marginal Models 
for Context-Specific Independencies 


The HMM model is a generalization of the classical log-linear model which allows 
to represent conditional and marginal independencies in the same model. Instead of 
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considering only the joint distribution, this model takes into account also marginal 
distributions and, on these, defines the log-linear parameters by respecting certain 
properties of completeness and hierarchy. These new parameters are contrasts (of 
sum) of logarithms of probabilities and henceforth we refer to them as HMM 
parameters. 

Let us consider 3 variables, X;, X2 and X3. We are interested in describing 
that variables X; and X2 are independent given by X3, jointly considered, X; L 
X2|X3, and that X2 is marginally independent of X3, X2 L X3. To this aim, we 
consider the marginal distribution of {X2, X3} and the joint distribution. We refer to 
these by defining the class of marginal distributions {{2, 3}; {1, 2, 3}}, where {2, 3} 
and {1, 2,3} are a shortcut for {X2, X3} and {X,, X2, X3}. Then we define the 
classical log-linear parameters on the marginal contingency table .%3 concerning 
the variables {2, 3} and the remaining parameters on the contingency table .%. Let 
us define the HMM parameters with the caption aS (iy), where .@ refers to the 
marginal distribution, denotes the subset of variables to which the parameter 
pertains and iy, in parentheses, represents the values of the variable selected in 
£ (when the parentheses are omitted, it means that the parameters refer to each 
iy € S#¢). Finally, in order to test the marginal and conditional independencies, 

j {2,3} {1,2,3} {1,2,3} 
we have to constrain to zero the parameters 3°,n) 5 and 7; 5.3 

Let us consider the following statement of CSI where the conditional indepen- 

dence holds only in a subset of variables. For instance, 
o i3€ 0 (1) 
Xf X2|X3 =, 13 ¢ #, 


where .# C .% is a subset of the values i3 of X3 for which the conditional 
independence holds. 

One main goal of this work is the definition of the constraints on HMM 
parameters in order to satisfy the CSI in formula (1). In [11], Nyman et al. 
deal with the log-linear parameters defined on the joint distribution. Here, as first 
improvement, we take into account the HMM parameters defined also on marginal 
distributions, see [10]. Thus, as before, we proceed to define the class of marginal 
distributions (the same mentioned above) and to specify the parameters evaluated 
on suitable marginal distributions. The constraints satisfying the CSI in formula (1) 
are 


ns (2) + 53 Oi, 13) =0 i2E AQ 133€ KH, (2) 


where .¥%j2 is the marginal contingency table concerning the variables {1, 2}. 
Another important aspect of this work is to consider the possible presence of ordinal 
variables. The classical log-linear models, in fact, look poor when we want to 
focus on the interpretation of the effects among the variables, in particular, when 
we take into account ordinal variables; see, for instance, [3]. For this reason we 
choose different criteria for coding the variables through the parameters. In fact, 
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beyond the classical baseline criterion, we take advantage of the /ocal criterion that 
is more suitable for ordinal variables. By adopting the Jocal criterion for coding the 
conditioning variable, as it is shown in [10], the constraints in formula (2) become 


i3 
ns (12) + > ne 5y Gia, iz) =0 HEA BE KH. (3) 


sa 
iz=1 


It is worthwhile to note that, when we deal with local parameters, if the CSI is 
presented in the following different statement: X; 1 X2|X3 > i3, i3 € %, the 
constraints in formula (3) are equal to the ones in formula (2). More details are 
given in [10]. 


2.2 Stratified Chain Graph Models 


A chain graph (CG) is a graph with both directed and undirected arcs and without 
any directed or semi-directed cycle. The vertices of a CG can be grouped in the so- 
called chain components, denoted by 7}, ...., Ts, that are the connected undirected 
components. Intuitively, chain graph models (CGMs) are graphical models which 
take advantage of chain graphs; see [5]. The structure of relationships among 
variables which follow an inherent order is well represented from these models. 
In particular, we can distinguish variables linked by symmetric relationships and 
variables linked by unilateral dependence. In this case we follow this order for 
collecting them in chain components. 

In the literature, the representation of independencies through CGs is not unique, 
a deep dissertation is discussed in [5]; in this work, we adopt the point of view 
of Lauritzen and Wermuth [7], also known as chain graph models of type I, 
that is a subclass of the HMM models; see [9] and [14]. These CGMs are the 
natural extension of the graphical models based on undirected graph and directed 
acyclic graph. They interpret the lack of (un)directed arcs conditionally with respect 
to the remaining variables in the same component. In addition, all the systems 
of independencies representable through these graphical models benefit from the 
existence of a smooth likelihood function. In order to take into account the CSIs, we 
propose stratified chain graph models (SCGMs) as extension of stratified graphical 
models (SGMs) introduced by Nyman et al. [11]. Similarly to SGM, we denote the 
CSIs through labelled arcs. Figure | depicts an example of a SCGM. In this case 
the lack of the directed arcs between the nodes X; and X5, X2 and Xs and finally 
between X2 and X3 represents the conditional independencies X;X2 L X5|X3X4 
and X2 | X3|X,X4X5. Then the labelled arc between the nodes X3 and X4 
represents the CSI X3 L X4|X1X2X5 = (i, *, is), where the asterisk is a symbol 
for referring to all the values of the variable X2 in this case. 
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Fig. 1 SCGM with the 
labelled arc X3 — X4 
referring to value i; of X1, 


value is of X5 and all values 
of variable X> 


X1,X2,X5=i1,*,i5 


&) 


3 Application 


In the next subsection we implement the presented model with an application to a 
real dataset. At first, we select the variables and we define the marginal distributions 
to take into account, according to the focus of the analysis. In order to find the best 
fitting model, we proceed with a three-step algorithm where each model is tested by 
using the likelihood ratio test G*. The algorithm is explained below. 


Step I We test the CGMs associated to all possible CGs obtained by deleting only 
one arc (at time) from the complete graph. Among these models, we select the ones 
with a p-value of the likelihood ratio test greater than 0.01. 


Step 2 Similarly to Step 1, we test the SCGMs associated to all possible SCGs 
obtained by replacing only one arc (at time) with a labelled arc with all possible 
labels considered one at time. Among these models, we select the ones with a p- 
value of the likelihood ratio test greater than 0.1. 


Step 3 From all admissible models selected in the previous two steps, we test all 
possible combinations of marginal, conditional independencies and CSIs and we 
maintain the one with lower AIC (Akaike information criterion) between the models 
with a p-value higher than 0.05. 


3.1 The Italian Innovation Survey 


We analyse two datasets, concerning the Italian Innovation Survey, pertinent each to 
a 3-year period: the first 2008-2010 and the second 2010-2012 [6]. The two datasets 
involve 16,531 and 18,697 small and medium sized Italian firms, respectively. We 
evaluate the revenue growth between the considered years, X; (1 = No, 2 = Yes). 
Then, we consider different factors that contribute to the innovation status of an 
enterprise: innovation in products or services or production line or investment in 
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R&D, X2 (1 = No, 2 = Yes); innovation in organization system, X3 (1 = No, 
2 = Yes) and innovation in marketing strategies, X4 (1 = No, 2 = Yes). Another 
type of variables we consider concerns the firm’s features: the main market (in 
revenue terms), X5 (A = Regional, B = National, C = International); the 
percentage of graduate employers, X6 (1 = 0% | 10%, 2 = 10% + 50%, 
3 = 50% | 100%) and the enterprise size, X7 (1 = Small, 2 = Medium). We 
consider three marginal distributions. First, let us define the marginal distribution 
{5,6, 7} in order to study the symmetric relationships among the firm features; 
the second distribution {2, 3, 4,5, 6, 7} to highlight possible influences of the firm 
features on the innovation variables; finally, we consider the joint distribution 
{1,2,3,4,5, 6, 7} in order to point out the effect of all variables on the revenue 
growth. 

Following the three-step algorithm proposed in Sect. 3, in Figs. 2 and 3, we report 
the best fitting SCGM for the period 2008-2010 and 2010-2012, respectively. Note 
that the CSIs are represented by red arcs. 


Fig. 2 Best fitting SCGM for 
the period from 2008 to 2010. 
12357 = (2, 2, 3, *) and 

in467 = (1, 2, 3, 2) 


Fig. 3 Best fitting SCGM for 
the period from 2010 to 2012. 
12356 = (2, 2, *, 3) 





Context-Specific Independencies Embedded in Chain Graph Models of Type I 179 


Table 1 Values of the statistic tests of the selected HMMMs corresponding to the SCGMs in 
Fig. 2 (period 2008-2010) and Fig. 3 (period 2010-2012) with the list of independencies that they 
represent 
Period | Independencies G df p-Value AIC 
2008-2010 X, L X2|X3X4X5X6X7 126.02 112 0.17 —225.98 
Xq L X7|X2X3X5X6 
| Xq 1 X6|X2X3X5X7 = 12357 
X3 1 X5|X2X4X6X7 = 12467 
2010-2012 XL X4|X2X3X5X6X7 145.93 123 0.08 —184.06 
| X3 1 X5|X2X4X6X7 
X4 1 X7|X2X3X5X6 = 12356 
where 12357 = (2, 2, 3, *), i2467 = (1, 2, 3, 2) and i2356 = (2, 2, *, 3) 








The list of independencies underlying the two SCGMs, together with the 
likelihood ratio test G*, the corresponding p-value and the AIC value, is reported in 
Table 1. 

Note that, in the two figures, despite the same structure of the undirected arcs, the 
presence (absence) of the directed arcs changes a little bit. In particular, in terms of 
innovation, the variable X3 affects the growth X, in both models, while the influence 
of the other two innovation variables, X2 and X4, interchanges. Furthermore, the 
dependence relationships between the variables X7 and X4 or between X5 and 
X3 result weak or null. Indeed, these independencies are present in both models, 
under the conditional or the CSI point of view. In the first period we may found the 
additional CSI between the X¢ and the Xq. 

Focusing on the CSIs, we recognize in the first model that the percentage of 
graduate employers (X¢) does not affect the innovation in marketing strategies (X4) 
when there is an innovation in products or services (X27 = 2) and in the organization 
system (X3 = 2) and when, whatever the size of the company (X7 = *), the firm 
works mainly in an international market (X5 = 3). Again, we can recognize that the 
type of the main market where the firm operates (X5) does not affect the innovation 
in the organization system (X3) when there is no innovation in products and services 
(X2 = 1) but there is innovation in marketing strategies (X4 = 2), the percentage of 
graduate employers is high (X¢ = 3) and the enterprise size is medium (X7 = 2). 
On the other hand, in Fig. 3 we can see that the size of the firm (X7) does not affect 
the innovation in marketing strategies (X4) when there is innovation in both products 
and services (X27 = 2) and organization system (X3 = 2), for any kind of market 
(X5 = «) and when the employers are highly specialized (high degree of graduated 
employers, X¢6 = 3). 

All the analyses are carried out with the statistical software R, and the package 
hmmm [4]. 
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4 Conclusions 


The representation of relationships among categorical variables increases by consid- 
ering the CSIs. Graphical models have shown useful properties in the representation 
of complex structure of dependencies and, also in this case, they reveal suitable 
features. On the other hand, the study of CSIs allows us to study the values 
of the variables that really discriminate among dependence and independence 
structure by neglecting the unnecessary parameters. For this reason it is possible 
to develop strategies concerning the values of the conditioning variable where the 
independence does not hold. Further developments on these models may regard the 
multivariate regression models associated, similarly to the approach of [8]. 
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Part V 
Big Data Analysis 


Big Data and Network Analysis: A M®) 
Combined Approach to Model Online od 
News 


Giovanni Giuffrida, Simona Gozzo, Francesco Mazzeo Rinaldi, 
and Venera Tomaselli 


Abstract In recent years, large volumes of data are generated by automatic 
extraction of information, innovative data mining, and predictive analytics. This 
paper proposes an innovative approach by combining Big Data with the analysis 
of relational structures in order to improve actionable analytics-driven decision pat- 
terns. From the website of one of the largest online Italian newspapers, interactions 
among users and their comments about a 2016 Italian constitutional review bill 
are organized in a Big Data audience model. Readers’ sentiments are measured 
and relational patterns are classified by descriptive measurements and clustering 
structures implemented in Network Analysis methods. 


Keywords Big Data - Network Analysis - Relational patterns - Clustering 
structures 


1 Big Data and Network Analysis 


Today, in the Big Data (BD) world, the extraction of meaningful knowledge from the 
available data represents a severe challenge for data analysts. Data is “big” because 
of its high dimensionality, complexity, and heterogeneity: for example, textual or 
graphical data extracted from social networks are often mixed with socioeconomic 
data for more tailored marketing activities. 
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Unfortunately, today traditional analytical tools quite often break when dealing 
with BD. As a matter of fact, due to the need of simultaneous processing, many 
statistical analytical techniques do not scale to BD. Suitable statistical techniques 
for dimensionality reduction are now needed for both data visualization and their 
numerical treatment. Specifically, to analyse BD sentiment analysis and classifica- 
tion methods are very useful. 

Social media are a precious source of data to study public opinion and definition 
of new trends, offering insights into attitudes, behaviours, discourses, and social 
linkages among individuals. Social media data could be integrated with common 
data sources like online searches, text mining, mobile phone devices, sensors, 
satellite images, or Global Positioning Systems (GPS) for a more complete data 
repository to analyse. 

The growing importance of social media data to be used in the social sciences 
arises from the availability and affordability of swiftly collected real-time large- 
scale data. With this data, we can follow individuals and their networks over time 
and across spaces, offering a new source of information, particularly when data 
quality is either poor or unavailable all together. 

Digital social interactions generate data in large scale and at low cost. Reported 
events or surveyed opinions are sources of data about context, content, and meaning 
of social interactions [2]. Network Analysis (NA) methodology is the theoretical 
foundation approach to analyse networks depicting social interactions among 
actors [5]. Specifically, it is also employed in relational and structural analysis 
of decision-making [6] and organizational behaviour patterns [3]. NA methods 
measure structural attributes of networks—by size, density, clustering, openness, 
stability, reachability, centrality measurements [11]—and classify structural pat- 
terns. 

In literature, we found one study about a large-scale analysis of the news media 
coverage of the 2012 US presidential elections campaign [10] that proposes a BD 
approach combined with NA methods. Key actors and their relations are extracted 
from the online media narrative of the US elections. And this information is 
organized as a directed semantic graph. 

Taking into account the interest for understanding and improving actionable 
analytics-driven decision patterns, here we present a study based on a very 
large database generated by one of the largest online Italian newspapers. All the 
comments of the readers concerning news and articles about the Italian Senate 
constitutional reform are employed as sources of info. Focusing on the audience 
analysis system to detect online news, the findings of the study show how NA 
methods and BD tools can efficiently be integrated in the analysis of relational 
patterns to improve decision-making processes knowledge. 

The approach is innovative in order to gain insights in the linguistic analysis 
of texts by extracting relational data. Combining BD mining techniques with NA 
measurements, the study aims to collect and analyse information about online 
readers to detect proxies of structural and positional network indicators. With this 
in mind, the network measurements are useful to define reliable impact indicators 
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suitable to detect opinions and moods in real-time in terms of relevance, visibility, 
reachability, marginality, and resilience of BD. 

We firstly developed an audience BD model to collect and manage info from 
users’ interactions with published news and comments about a recent Italian 
constitutional review bill with important political implications. This is achieved 
by using the advanced search feature of one of the most diffused Italian online 
newspapers to extract text entities in order to profile users’ targets linked to the 
interest and mood by reads and comments. 

Afterwards, NA methods are employed to analyse the BD model measuring the 
relational structure of readers browsing on the newspaper website, focusing on the 
importance of selected news with a great impact in terms of capability to connect 
readers and commenters as web users in a net. Focusing on the audience analysis 
system to detect online news, the findings of the study show how NA methods and 
BD tools can efficiently be integrated in the analysis of relational patterns to improve 
the knowledge in the decision-making processes. 


2 The Big Data Audience Model 


Our BD audience model, already applied to other online newspapers [7], is a 
standard relational model composed of tables and relationships among those [4]. Ina 
typical relational model, “tables” (i.e., “relations”) are collections of “rows”, where 
each row contains several fields arranged in columns. Two tables may be “joined” 
based on some “key” values. A “join” is a relationship between two tables and it can 
be of different types: “one to many”, “many to many”, or “one to one”. “A one to 
many” relationship happens when one row in one table corresponds to many rows 
in the second table. “A many to many” relationship happens when many rows from 
one table correspond to many rows in the second table. Specifically, in our model, a 
user performs many actions (i.e., reads many articles on the newspaper), giving rise 
to “a one to many” relationships between the “user” table and the “action” table. 

In our BD audience model, each user browses the website and performs actions 
towards specific contents. A “content” is a general html document published on 
the newspaper website. An “action” is a pageview generated by a user on a given 
content. Using the advanced search feature available on the newspaper site, we 
retrieved all articles containing the “riforma Senato” string in the time period from 
January Ist, 2014 through December 31st, 2014. Thus, we collected all 47 articles. 

After a manual check, we discarded 29 of those as they were not actually focused 
on the specific topic but on other political issues. The result was 18 articles from the 
“Politics” section of the newspaper about the Senate reform bill, published in the 
time frame from March 12th, 2014 until August 8th, 2014. In the same time frame, 
collectively, in the “Politics” section of that newspaper, a total of 1788 articles 
were published. After an article is published on the newspaper, readers can place 
comments on that article; this basically starts a blog around that specific article. 
We assume that the larger the number of comments to a specific article, the more 
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that article raised readers’ interest. Thus we use the number of comments to each 
article as proxy for the readers’ interest to that article. The 18 specific articles we 
selected generated a total of 886,898 pageviews and 2461 comments. Whereas the 
1788 general politics articles generated a total of 32,774,270 pageviews and 14,108 
comments. In order to compare the readers’ interest for the specific Senate reform 
articles, we computed some quantitative measures and specifically: 


¢ the average number of pageviews per article. This measures readers’ interest on 
the topic: 49,272 for the 18 articles versus 18,330 for the 1788 general articles. 

¢ the average number of comments per article. This measures readers’ engagement 
with the topic: 137 for the specific articles versus 8 for the general articles. 

¢ the probability of generating new comments. This is the probability that a reader 
writes a new comment after reading the article: 0.28% versus 0.04% for specific 
and general articles, respectively. 


These numbers clearly show a significant interest of the readers for the Senate 
reform. Furthermore, in order to check the relevance of the Senate reform, we 
compare NA results obtained from two data samples: the first includes 18 articles 
about the Senate reform, while the second one 18 random articles selected from 
political section of the newspaper. 

Our analysis focuses on the measurement of a readers’ mood (from negative 
to positive) for each comment. Thus, we label each comment through a sentiment 
analysis algorithm with a score ranging from —1 to +1. Both the sentiment analysis 
and the extraction of semantic entities contained in the 18 articles were made 
using a public Application Programming Interface (A.P.I.) service for Semantic 
Text Analytics, provided by https://dandelion.eu/. We found out that the entities 
that received most of the comments were: “Senato’”, “Governo’”, “Parlamento’’, 
“Senatore’”’, and “Matteo Renzi’, with an average sentiment score of —0.426. Since 
the comments in a newspaper generally tend to be negative (see, among others, [9]), 
this negative sentiment score does not immediately imply that commenters were 
against the proposed bill. Comparing the average sentiment score of the top entities 
with the average sentiment score of the comments in the general Politics section 
(—0.38), the commenters have a slightly positive sentiment (0.04), representing only 
8% of the standard deviation measured on comments in 0.5. 


3 The Network Analysis Application 


We employed NA methods to describe relational BD drawing in a network graph, 
points (nodes) and lines (edges) connecting them, by Gephi and NodeXL tools. In 
order to detect the main relational clusters, according to the NA criteria [11], the 
data structure is modified assuming that the nodes are the reference objects and the 
links among nodes are the units of analysis. For this purpose the analysis does not 
focus on readers’ attributes but on the importance of news as their capability to 
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connect people according to the following rules: 


* two nodes/users are connected if they read the same articles also at different 
times, 

¢ in the two graphs (Fig. 1) links between readers are represented by removing 
repetitive and recursive ties. 


According with these rules, data from the two samples described in Sect. 2 are 
used to reconstruct each relational structure. Finally, two graphs are obtained: the 
first sample includes 1892 nodes and 386,952 edges among all registered readers 
visualizing 18 political articles (selected randomly); the second sample includes 
15,550 nodes and 515,739 edges among nodes of the specific articles from the 
“Politics” section of the newspaper about the Senate reform bill. 

With the aim to analyse the structural behaviour of graphs, the modularity 
measure and the clustering coefficient are employed. The modularity quantifies the 
quality of the division into “modules” or communities. A good subdivision has 
high values of modularity. The logic employed provides that the density of bonds 
will be high within communities and limited between communities. The modularity 
coefficient is a measure that shows the quality of communication in the network. The 
modularity index range can vary from 0 to | and a greater coefficient indicates the 
tendency of the whole network to be structured in cohesive groups with high internal 
connections but not very connected to each other. It is a good tool for our analysis 
goals [1, 8]. Specifically, this index identifies which and how many “communities” 
are formed on specific aggregates of information or if the interest for the issue 
(in our case, the Senate reform) is, instead, “generic” or “generalized”. This can 
happen when there are no different “communities” but the network is defined rather 
as unitary. 

The graph modularity is 0.37 for links among political news and 0.18 for news 
about the Senate reform, then modularity shows that the graph (Fig. 1) on Senate 
reform is more “compact” compared to the graph about general political issues. 
This latter shows separated subgroups and a weak community structure. However, 
both graphs have the same clustering coefficient (0.84), where this index measures 
the level at which the nodes of a graph tend to be connected through the proportion 
of closed triple bonds within the network. The clustering coefficient measures, in 
particular, how much the nodes to which a focal node is linked are in turn connected 
to each other. 

In this study, this coefficient can be particularly useful for evaluating the 
importance attributed to some news or information, evidently so “interesting” to 
produce a convergence of interests in little groups. In fact, the measure takes into 
account the connections focused on specific issues. Since both graphs in Fig. 1 have 
the same clustering index but different values of the modularity coefficients, this 
suggests that the focus could be—in this case—on the identification of problems 
and themes that probably include different dimensions and arguments rather than 
single news or information which polarize the attention of specific, small groups. 

The Clauset-Newman—Moore algorithm [1] permits to point out cohesive clus- 
ters within the graph, grouping the vertices by means of modularity. This is a 
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method to select more cohesive groups within the net. It is based on modularity 
as property of the network specifying division of network into “communities”. The 
community structure implies to group vertices if there is a higher density of edges 
within groups than between them. The problem of detecting such communities 
within networks has been recently studied. We choose the Clauset-Newman— 
Moore algorithm because it works well using sophisticated data structures and 
extremely large networks. The first graph (with links about political articles) shows 
intersections among seven collapsed groups, while the second (Senate reform) four 
clusters (Fig. 1). 

Here nodes are not singular web users but overlapped cliques (the groups are all 
nodes strongly connected each other). However, each cluster has a specific relational 
structure and presents specific traits if compared to the structural characteristics of 
the network. Graph about general political news is (as expected) more heteroge- 
neous. Finally, we obtain seven different subgroups. These clusters emerge because 
the links among nodes are higher compared to that detectable in the overall structure. 
Besides, there are strong structural differences among groups (Table 1). This means 
that there are seven different profiles of users connected from a similar selection of 
news visualized. 

















Fig. 1 Collapsed seven groups on general political news (on the left side) and four groups on 
Senate reform (on the right side) 


Table 1 Degree, 
betweenness, and clustering 


Political news Clustering 


(average values) for groups Tot (averages) 756.27 0.84 
on political news issue Group | 1353.45 0.73 
Group 2 371.68 0.91 
Group 3 1063.77 0.79 
Group 4 102.57 0.96 
Group 5 824.47 0.85 


Group 6 406.00 | 0.00 _| 1.00 


Group 7 
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Table 2 Degree, 


; Senate reforms | Degree | Betweenness | Clustering 
betweenness, clustering 


(average values) for groups Tot (averages) | 517.21 | 515.99 0.85 

on senate reform issue Group 1 411.98 | 748.35 0.81 
Group 2 605.00 | 184.02 0.91 
Group 3 | 626.41 | 568.31 0.80 
Group 4 412.20 | 113.50 0.98 


First, second, and third groups are the most important, with both high degree 
(more articles viewed) and betweenness (more brokers). The centrality measure 
distinguishing the most relevant subgroups is the betweenness for the first graph and 
the degree for the second (comparing averages in Tables | and 2). It is important 
to notice that by selecting a more specific section of the whole graph is like 
applying a “magnifying glass” on the whole structure. Zooming-in, we notice more 
clearly roles and information originally hidden in the graph. As a matter of fact, 
the intermediary function between “groups” is now visible through the centrality 
betweenness measure, while it was originally hidden by the strong overlap between 
multiple groups. 

NA tools can be useful to drive the decision-making process since they show the 
relational structure among web users. The network tools, indeed, allow detecting 
which news lead to more or less homogeneous groups. The integration of this 
information with others, which also take into account comments and relational 
profiles, can be particularly useful to drive decision-making process. It is also 
possible to carry out analyses that refer to groups and not to single subjects as a 
reference unit. In this way, links within the groups are more or less dense according 
to the common incidence attributed to specific information. This behaviour cannot 
be observed by selecting only individual elements. One image we obtain is referred 
to the diameter and the shortest paths in the graph. This information is about 
the number or the kind of items that connect all people within the net. Another 
information is about the identification of central or peripheral subgroups compared 
to the whole net. 

This relational structure could mean that there are groups of readers who 
intercept multiple cross information (there is not a specific kind of information but 
these readers simply select the more important news from time to time). This is the 
main feature of the first group, where the most informed nodes, with the highest 
betweenness measure, are the “attractors” of the network, the structure of the graph 
depending by them. 

The configuration indicates that—within these groups—there are readers with a 
greater heterogeneity in cultural consumption, able to “connect” with very different 
contexts and arguments, acting as a structural “broker” among other (more selective) 
groups. Groups of this type have a thematically more varied configuration and they 
are more likely to consist of an average informed user, who will tend to select the 
main information on lines differentiated issues. 
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Instead, a network characterized by subgroups with higher degree centrality (as 
the second graph) is more homogeneous with respect to the selected items. In this 
case, few issues convey the interest. The highest degree value means that nodes are 
directly and strongly connected each other (see often the same news): readers are 
more specialized. 


4 Conclusions 


We proposed an innovative blended approach between BD mining and NA methods, 
useful to identify positional (central vs peripheral) and relational (brokers vs 
isolated) structures of nodes or units within BD. We modelled a very large dataset 
containing all readers’ comments on articles published by a prominent Italian 
newspaper. We focused on the political debate around a proposed reform for a 
change of a constitutional bill. The dataset at hand, besides being very large in 
nature, is collected in real-time, and this offers another interesting dimension for 
data analysis. This may turn into a very valuable tool for policy makers in general 
as they can measure, in real-time, readers’ reaction to new law proposals. As a 
consequence, such a data processing can lead to very meaningful results both to 
measure overall readers’ sentiment and support the decision-making process. 

In our work, we mostly focused on users’ comments collected by the online 
newspaper. In the future, we will try to extend the dataset by including addi- 
tional data such as user’s demographic (if available for the registered users) and 
behavioural data, that is, data about interactions over time between each user and 
the newspapers contents. We believe that this additional type of data could bring 
some very valuable insights when combined with the commenting activity analysed 
in the present work. And this should lead us to better and more refined models. 
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Experimental Design Issues in Big Data: M®) 
The Question of Bias Sess 


Elena Pesce, Eva Riccomagno, and Henry P. Wynn 


Abstract Data can be collected in scientific studies via a controlled experiment or 
passive observation. Big data is often collected in a passive way, e.g. from social 
media. In studies of causation great efforts are made to guard against bias and 
hidden confounders or feedback which can destroy the identification of causation 
by corrupting or omitting counterfactuals (controls). Various solutions of these 
problems are discussed, including randomisation. 


Keywords Big data - Model bias - Experimental design - Nash equilibrium 


1 The Challenges of Experimental Design with Big Data 


The value of experimental design in physical and socio-medical fields is increas- 
ingly realised, but at the same time systems under consideration are more complex. 
It may not be possible to do a carefully controlled experiment in many areas, but 
at the same time huge quantities of data are being produced, for example, from 
social media and web-based transactions. An added problem is that the traditions of 
experimental design differ. For example, in engineering design it will be possible 
to do a control experiment on a test bench, whereas in the social-medical sciences 
the local counterfactual will be missing: we do not know how a particular patient 
would have fared if they were not given the drug. Foundation work on these issues 
is by Rosenbaum and Rubin [11]. Roughly, the causal effect can only be measured 
on the average, with great care taken about the background population, with more 
reluctance than in the physical sciences to extend the conclusions outside the 
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population under study. An old issue, which goes back into the history of science, 
is the distinction between active and passive observation. Is placing a sensor on 
a driverless car to collect data (for control) an intervention in the sense of the 
declaration that to prove causation you have to intervene? Despite these different 
historical traditions there seems to be general agreement (1) that deriving causal 
models is a kind of gold standard and (2) that to produce a causal model we need 
to guard against bias from different sources: hidden confounders, sampling bias, 
incomplete models, feedback and so on. 

We cover a few of the ideas from the theory of causation (Sect.2) and then 
suggest that the double activity of building causal models while at the same time 
guarding against bias has features of a cooperative game (Sect. 3.1). At its simplest 
a randomised clinical trial is minimax solution to a game against the sources of 
bias. With this in mind we make the natural but speculative suggestion that we can 
import theories of Nash equilibrium and supply a simple example motivated by the 
theory of optimum experimental design under a heading of optimal bias design. We 
could have taken a Bayesian optimal design, for example, from [7, 13]. But for this 
short paper we felt it was enough to allow our randomness to come from the error 
distribution or the randomisation itself. 


2 Causal Models 


A major critique of passive analysis of the machine-learning type is the lack of 
attention to the building of causal models. We discuss briefly the main ingredients 
of causal graphical models and then the implications for experimental design [12]. 

A causal model is often described via a direct acyclic graph (DAG), G(E, V), 
where each vertex i € V holds a (possibly vector) random variable X;. Care has to 
be taken with the edges i — j. The natural intuition that the edge means Xj; causes 
X ; is not correct, at least not without much qualification. The DAG is a vehicle for 
describing all conditional independence structures. 

We can define a variable X ; which is never observed as latent, also hidden. There 
is a slight difference: hidden may be that we do not know it is there but it might be. 
Latent may also be taken as expressing prior information. Thus a latent layer in 
machine-learning context may be included to allow a more complex model, such as 
a mixture model. 

The conundrum with causal models stems from the distinction between passive 
observation and active experimental design. Experimental design is an intervention 
and there are essentially two types. First, we can simply apply some kind of 
treatment at node i to obtain a special X;, for example, give a patient i a drug. 
Second, and even more active, one can set variable X; to say high and low levels. 

Passive observation means that a joint sampling distribution covers all observed 
X;. The act of setting should be thought as advantageous in the sense that we are in 
some kind of classical or optimal design framework, but disadvantageous in that it 
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is destructive. Roughly, setting X; destroys our ability to learn about the population 
from which X; comes. 

Consider a simple DAG: X, > Xo > X3 > X,4 and for ease of 
explanation we write down a univariate linear version with obvious interpretation 





Xi =O%+e) X2=0;X, +e 
X3=0X2+6 X4 = 03X34 €4, 


where {e;}j=1,..,4 are error variables. Suppose we are interested in the last causal 
parameter 63. Ideal would be to carry out a controlled experiment, setting the levels 
of X3 and observing X4. The first assumption to make is governed by the following: 


Principal 1 The distribution of X4 conditional on a set value of X3 is the same as 
when the same value of X3 was passively observed. 

There are arguments to justify this but it remains a most important assumption. 
We can also passively observe X1, X2, X3, X4. Note that the model is nonlinear in 
the parameters as X4 = 00010203 + 610203€1 + 0203€2 + 63€3 + €4 and also that X4 is 
Gaussian if the {e;} are Gaussian. One may not have to choose between a controlled 
experiment and passive observation. This leads to another principal, see [6]. 


Principal 2. A mixture of passive observation experiment and active experimenta- 
tion may be optimal. 

There is considerable discussion in trying to understand how to learn for DAG 
models with interventions, and controlled experiments are a form of intervention. 
Most effort has been put into identifiability; see [4] for a review. In our example 
suppose there is an extra arrow X; — X4. Such an arrow is referred to as a 
backdoor. If the index is time we can say that there is another path from X, into 
the future in addition to X; — X27 > X3—> Xq. 


X, + X, > X3 


i a 


Now if we fix X3 we cannot so simply estimate 63 because the distribution of X4 
is corrupted by the new path. In the observational case, we have another parameter 
and the changed equation 


X4 = 03X3+04X1 + €4. (1) 


There are now too many parameters for the observations (even with replication). 
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The celebrated backdoor theorem due to [10] tells us how to obtain identifiability. 
Suppose you want to see whether X; causes X ;, then we need two conditions for a 
good conditioning set of variables S: 


1. No node (variable) in S is a descendent of X;. 
2. S blocks every (backdoor) path from X; to X; that has an arrow into X;. 


This theorem tell us: (1) whether there is confounding given this DAG, (2) if it 
is possible to remove the confounding and (3) which variables to condition on to 
eliminate the confounding. For example, if we are trying to establish the effect of 
X3 on X4, then we must observe, or set and condition on, any X; which is not a 
descendant of X3 and blocks all paths from, in our case, ancestors of X3. In addition, 
if there are any other downstream (future) variables such as an extra X5 with X4 > 
X5, then X5 will not interfere with our causal analysis; we can forget it. In summary 


Principal 3: Guard against effects from nuisance confounders by suitable additional 
conditioning. 


3 Bias Models 


Before presenting our contribution, we briefly review relevant literature. For the 
model without bias 


E[¥;] = 67 f(x) (2) 


with 6 € R?, x; € & fori = 1,...,n, and the usual assumptions on the 
random error, [3] and [15] propose information-based and sequential algorithms 
(also response adaptive in [3]) for the selection of a subsample from a large, or 
possibly big, dataset. They provide an optimal subsample with respect to a chosen 
utility function. 

Bias model and optimal design of experiments were considered by Montepiedra 
and Fedorov [9] and recently in the context of big data by Wiens [16]. Those authors 
add a bias term 6” h to the model (2) and thus study E[Y;] = 6 f (xi) + 6? h(x). 
They search for a design which minimises the mean square error of the least square 
estimator of the 6 parameters, guarding 6 from the bias term. In particular [16] 
proposes a theory of minimax /- and D-robust design as subset of a large finite set 
of points, while [9] proves results for a design to be optimal when the effect of the 
bias term is bounded above from a given constant and below from zero. 

The conditioning argument of the backdoor theorem is a way of avoiding biases. 
In the above example in Eq. (1) 04 gives a bias. Enough conditioning creates a kind 
of laboratory inside which we can conduct our experiment by setting the level of X3. 
Sometimes this is referred to as creating a Markov blanket. But there are sources of 
bias which either we do not know at all or have some ideas about but are too costly 
to control. Biases range from those we really know about but simply do not observe 
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to those which are introduced to model additional variability. This will affect the 
overall distribution of the observed variables, in a way similar to classical factor 
analysis. 


Principal 4 Special models are needed to ascertain and guard against hidden 
sources of bias, for example, using randomisation or latent variable methods. 


We build on the ideas in [9] and discuss in detail how optimal experimental 
design can guard against hidden sources of bias, indicated below with the letter 
z. Thus consider a two part model in which the first part is the causal model of 
main interest with parameters @ and the second part is the bias term with parameters 
g. This separation is familiar from traditional experimental design where 6 and ¢ 
might be treatment and block parameters, respectively [1, 9]. The model is 


¥; = 0" f(x) +67 e(zi) +e, (3) 
where the €; are independent and have equal variance a. 
We want to protect the usual least square estimator, 6, obtained from the reduced 
model in Eq. (2) ignoring the bias term $7 g(z;). Define the full moment matrix by 
M\, M 
M= (oe gz)" ) (FO)", 8@)" xz Aa, 2) = i. a : 
M21 M22 


where &,,, is the experimental design measure over (x, z)-space. Then the mean 
squared error (MSE) matrix can be written as 





(6 —6)(6 —0)"} =o2N7!R, 











where 
R= M;' + My Miw! MM, =S,;+5) 


with yy = Xo the standardised bias parameter and N the sample size (see [9]). 
Well-known criteria for optimality ask to minimise over the choice of experi- 
mental design the quantity: trace(R) = trace($,) + trace(S2) (the trace criteria or 


A-optimality) or det(R) = det(S}) (1 +7 MoM, ' Mow) (the D-optimality 
criteria). 

The design problem is easier when the design space and design D are direct 
products and thus can be written as 


2x F, D=D,x Do (4) 
with x € 2 and z € &, Dy, and Dy are finite subsets of, respectively, 2 and 


&. Then, trace(R) includes a term which depends only on Dy, likewise a factor in 
det(R) depends only on Dj. 
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The most familiar example is from clinical trials where one compares a treatment 
against a control. Consider the simple case 


Yi =O +0401 -2D+é 
Yo; = 0, — 62+ b(22) —-Z+¢;, 
where the z; are unwanted confounders which may be a source of bias, the z is the 


grand mean and n = N/2 points are allocated to each group. Adapting the above 
analysis we obtain 


1 0 0 
M= =| 0 1 (Z1 —Z2)/2 |, 
0 @1 — 22)/2 s 





where the Z;, i = 1,2, terms are the group means and Ns = 7"_,(z1i — 2)? + 
yey (z23 — z)*. The bias term is trace(S2) = W?(Z, — Z2)7/4 which is zero when 
Z| = Z2. This is the simplest case of balance and extends easily to multivariate z. A 
number of methods of achieving balance have been studied, each of which can be 
cast in the above framework: 


1. Stratification: balancing in each stratum and then aggregating the difference. 

2. Distance methods: pairing up treatment and control with which are close in z- 
space with respect to some distance such as Mahalanobis distance [8]. 

3. Propensity score. This much researched method seeks to balance in such way as 
to ensure that the bias correction is extended to a larger parent population [11, 
12]. Some adaptation of the above method analysis is possible in this case. 


3.1 A Game Theoretic Approach 


For ease of explanation we introduce two players: Alice (A) and Bob (B). Alice 
selects a causal model design D, using {6, f} and Bob selects design D2 using 
{@, g}. In the product case (4), Alice and Bob can operate separately. In other 
cases they may cooperate fully to find the best design over the design space for the 
pair (x, z). However there is another possibility, namely to use a Nash equilibrium 
approach [2, 5, 14]. 

For two players A and B with composite cost functions Cj (u, v), C2(u, v) and 
solutions u*, v* at equilibrium it holds 


Alice : u* = argmin C1(u, v*) 
u 

Bob: v* = argmin C2(u™, v). 
Uv 
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We illustrate the presence of Nash equilibrium in causation-bias setup by a simple 
example. We take a distorted design space, but still a product-type design measure. 
Thus, let the model be 











“(Y) = 009 + 01x + bz 





and let the design have a four support points (we put the design measure in the 
second line): 


ea (0, 1), (0, —1), (—1,-1) 
ap, (1—a)B, a(l—B), (L—a)(1—B)J 


where 0 < a, 6 < 1 
Since, in this case, M17 is a2 x 1 column vector: 


trace($2) = Ww? Mo, MM. 
The equilibrium takes the form: 


Alice : a* = arg ming trace(S;) 
Bob: 6* = arg ming trace(S2). 


There are two Nash equilibria given by solving + trace(S p= ygtrace(S2) = 0. 


This gives two solutions (a*, 6*) and (4, 4) with a* = 0.59306 and 6* = 0.08274 
computed numerically. Note that both solutions do not depend on y, and in fact 
scale invariance of this kind is a well-known feature of Nash equilibrium. 

We can compare the solution with an overall optimisation by setting 
w = 1 and minimising trace($;) + trace(S2). The minimum is 4, it is 
achieved at (a,B) = (5,5) with (trace(S;), trace(S2))=(3, 1). Whereas at 
(a*, B*) the value of trace(S;) + trace($2) is approximated to 5.1735 with 
(trace(S;), trace(S2))=(4.483, 0.6905). 

Let us return to the role of Bob in our narrative. His experimental design decision 
will depend on his knowledge about the bias. For ease of explanation we reduce the 
argument to two canonical cases. 


Approach 1 Unknown w 
trace(S2) = trace (My! Mov" MaM;;') = wv’ Ow 
Q\= MoiM7 7 M12. 
Under a restriction ||y|| = 1 this achieves a maximum at the maximum eigenvalue: 


Amax(Q1). We can take this as our criterion which is close to the E-optimality of 
optimum design theory. 
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Approach 2 In Eq. (3), for unknown $7 g(z) = h(z) € # in some function class, 
we have 


IE) — Al? = A(z)? Qo h(z) 
Qo = X1M;?X{. 


where X; = [f(«)]xep,. We cannot optimise over x (X1) because, in our narrative, 
Alice needs it for the causal parameter @. A solution is then 














om T 
_ WA {max (n@ Q2 nc)| ; 
where P, is the randomisation distribution. In the language of game theory this is a 
mixed strategy to achieve a minimax solution. 

Randomisation has been heralded as one of the most important contributions of 
Statistics to scientific discovery. There are several arguments put forward for using 
randomisation: (1) it helps support assumptions of exchangeability in a Bayesian 
analysis, (2) it supports classical zero mean and equal variance arguments and (3) it 
produces roughly balanced samples. 


4 Conclusion 


After a discussion of some issues related to the use of experimental design to help 
establish causation in complex models, we study in a little more detail the use of 
optimal design methods to remove bias. In the standard case the causal part of a 
model can be estimated orthogonally from the bias. In more complex cases the 
problem can be set up as a cooperative game. We demonstrate the existence of 
Nash equilibria for a small example and point to a formulation which would include 
randomisation. This is a preliminary work, establishing model classes (for example, 
special h’s, #’s, Pz’s) and conditions on D for which Approaches 1 and 2 can be 
turned into efficient algorithms is object of current work. The general proposition is 
that such methods will help protect causal models against bias. 
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